Invisible implants in source code

Researchers from Cambridge describe the Trojan Source method for inserting hidden implants in source code.

University of Cambridge experts described a vulnerability they say affects most modern compilers. A novel attack method uses a legitimate feature of development tools whereby the source code displays one thing but compiles something completely different. It happens through the magic of Unicode control characters.

Unicode directionality formatting characters relevant to reordering attacks.

Unicode directionality formatting characters relevant to reordering attacks. Source.

Most of the time, control characters do not appear on the screen with the rest of the code (although some editors display them), but they modify the text in some way. This table contains the codes for the Unicode Bidirectional (bidi) Algorithm, for example.

As you probably know, some human languages are written from left to right (e.g., English), others from right to left (e.g., Arabic). When code contains only one language, there’s no problem, but when necessary — when, for example, one line contains words in English and in Arabic — bidi codes specify text direction.

In the authors’ work, they used such codes to, for example, move the comment terminator in Python code from the middle of a line to the end. They applied an RLI code to shift just a few characters, leaving the rest unaffected.

Example of vulnerable Python code using bidi codes

Example of vulnerable Python code using bidi codes. Source.

On the right is the version programmers see when checking the source code; the left shows how the code will be executed. Most compilers ignore control characters. Anyone checking the code will think the fifth line is a harmless comment, although in fact, an early-return statement hidden inside will cause the program to skip the operation that debits bank account funds. In this example, in other words, the simulated banking program will dispense money but not reduce the account balance.

Why is it dangerous?

At first glance, the vulnerability seems too simple. Who would insert invisible characters, hoping to deceive source code auditors? Nevertheless, the problem was deemed serious enough to warrant a vulnerability identifier (CVE-2021-42574). Before publishing the paper, the authors notified the developers of the most common compilers, giving them time to prepare patches.

The report outlines the basic attack capabilities. The two execution strategies are to hide a command within the comments, and to hide something in a line that, for example, appears on-screen. It is possible, in theory, to achieve the opposite effect: to create code that looks like a command but is in fact part of a comment and will not be run. Even more creative methods of exploiting this weakness are bound to exist.

For example, someone could use the trick to carry out a sophisticated supply-chain attack whereby a contractor supplies a company with code that looks correct but doesn’t work as intended. Then, after the final product is released, an outside party can use the “alternative functionality” to attack customers.

How dangerous is it, really?

Shortly after the paper was published, programmer Russ Cox critiqued the Trojan Source attack. He was, to put it mildly, unimpressed. His arguments are as follows:

  • It is not a new attack at all;
  • Many code editors use syntax highlighting to show “invisible” code;
  • Patches for compilers are not necessary — carefully checking the code to detect any accidental or malicious bugs is sufficient.

Indeed, the problem with Unicode control characters surfaced, for example, way back in 2017. Also, a similar problem with homoglyphs — characters that look the same but have different codes — is hardly new and can also serve to sneak extraneous code past manual checkers.

However, Cox’s critical analysis does not deny the existence of the problem, but rather condemns reports as overdramatic — an apt characterization of, for example, journalist Brian Krebs’ apocalyptic ‘Trojan Source’ Bug Threatens the Security of All Code.

The problem is real, but fortunately the solution is quite simple. All patches already out or expected soon will block the compilation of code containing such characters. (See, for example, this security advisory from the developers of the Rust compiler.) If you use your own software build tools, we recommend adding a similar check for hidden characters, which should not normally be present in source code.

The danger of supply-chain attacks

Many companies outsource development tasks to contractors or use ready-made open-source modules in their projects. That always opens the door to attacks through the supply chain. Cybercriminals can compromise a contractor or embed code in an open-source project and slip malicious code into the final version of the software. Code audits typically reveal such backdoors, but if they don’t, end users may get software from trusted sources but still lose their data.

Trojan Source is an example of a far more elegant attack. Instead of trying to smuggle megabytes of malicious code into an end product, attackers can use such an approach to introduce a hard-to-detect implant into a critical part of the software and exploit it for years to come.

How to stay safe

To guard against Trojan Source–type attacks:

  • Update all programming language compilers you use (if a patch has been released for them), and
  • Write your own scripts that detect a limited range of control characters in source code.

More broadly, the fight against potential supply-chain attacks requires both manual code audits and a range of automated tests. It never hurts to look at your own code from a cybercriminal perspective, trying to spot that simple error that could rupture the whole security mechanism. If you lack the in-house resources for that kind of analysis, consider engaging outside experts instead.

Tips