String Concatenation - Root of all Evil
After college, I spent three years in software security acting as a pen-tester. It was my job to identify security vulnerabilities in the company's software via exploitation. During that time, string concatenation was the root cause of the vast majority of my findings. For that reason, it is my general philosophy that:
If you're concatenating strings, you're doing it wrong.
What?
To make my point easier to grok, let's start with a simple example:
>>> person.first_name + " " + person.last_name
'Bob Smith'
We've joined a person's first and last name with a space. The concatenated value can be thought of as a space delimited string. In of itself, this is not an issue. But, what happens when we need to reverse this transformation? Easy enough. Using the previous output as our input, we split on the space character.
>>> name = "Bob Smith".split(" ")
>>> print name
['Bob', 'Smith']
What happens if the first name contains a space? If an attacker controls that value, an injected space could break the application's intended behavior. It's conceivable to imagine that after deserialization, the value might be used in some security related decision.
>>> name = "Bob Smith RealLastName".split(" ")
>>> print name
['Bob', 'Smith', 'RealLastName']
>>> if "Smith" == name[1]:
... print "Welcome Smith!"
...
Welcome Smith!
To address the vulnerability, we have a few options:
- Validate the first name, rejecting requests with spaces
- Sanitize the first name, removing spaces
- Introduce an escape sequence and escape the names
Validation and sanitization are appealing choices, but if the business logic forbids it, our only option is escaping.
A simple solution would be to replace spaces with '\ '
and backlashes with '\\'
.
>>> person.first_name.replace(' ', '\ ') + " " +
person.last_name.replace(' ', '\ ')
'Bob\\ Smith RealLastName'
You might be inclined to argue that this is a sufficient fix (nevermind that it's not to spec). However, I'm here to suggest that's not true. Manually escaping and concatenating strings is implicitly brittle. Forgetting to apply escaping algorithm in one place is enough to introduce a severe vulnerability.
The mistake here was casually performing serialization rather than utilizing some library, whose sole responsibility is serialization. This example is somewhat contrived, but the fundamental principal can be expanded to CSVs, XML, JSON, URIs, SQL, or any formatted data.
Examples
Let's break down a few exploits to demonstrate my point.
SQL
Here we have a simple SQL injection in PHP:
<? "SELECT * FROM table WHERE col='" . $input . "'";
If the input contains a single quote, "'"
, the SQL query's structure is disrupted.
An attacker can use this to alter the query.
(This example is benign.
There are significantly more malicious uses for SQL injection, but that's not what I'm here to discuss.)
<?
$input = "' OR ''='";
$query = "SELECT * FROM table WHERE col='" . $input . "'";
echo $query;
SELECT * FROM table WHERE col='' OR ''=''
How do we fix it? There are three general approaches:
<?
1. "...WHERE col='" . str_replace("'", "", $x) . "'"
2. "...WHERE col='" . mysql_real_escape_string($x) . "'"
3. $mysqli->prepare("...WHERE col=?", $x)
The first approach manually sanitizes the input by blacklisting the single quote character. This approach is problematic, because it doesn't account for all possible special characters. If the surrounding quotes were instead double quotes, this method would fail. When I've seen this type of sanitization, it is typically done at the beginning of the request, far away from the code which actually utilizes the variable. That increases the probability of such a mistake.
The second approach utilizes a library call to escape the input. The last approach parameterizes the query, sending the query and its arguments separately to the database. Both the second and third approach utilize a library call, and both work for this query. If that's the case, what's wrong with using the version that utilizes string concatenation? What if the column was instead a number?
<? "...WHERE number=" . mysql_real_escape_string($x);
There are no characters to escape in this case! An injection like "0 OR 1=1"
will still work.
If the variable $x
had been validated as an integer prior to use, there wouldn't have been an issue.
However that's a brittle solution, because a simple mistake (or refactor) could still compromise the application's security.
What's really wrong here? The query is the formatted data. The library call is used to escape a value which is being serialized into the query. That's the responsibility of the library. By using string concatenation, we've effectively written part of a SQL serializer.
(Learn more about motivations to use parameterized queries.)
URI
The URI is a fascinating point to perform injection attacks upon. They are extremely easy to construct without a library, which means it's done frequently, and the results work most of the time. As more applications move towards service oriented architectures, URIs are increasingly used as a component of the communication. If unescaped input is placed into the URI, it might be possible to alter the application's intended behavior.
In the following example, we have a web application which transfers funds by dispatching to an HTTP service. The user controls the variable amount, but not user or target. The service domain s.xdxa.org is not internet accessible. We can assume the backend service will validate amount as a positive integer prior to use.
>>> print 'http://s.xdxa.org/pay?amount=' + amount +
'&from=' + user +
'&to=' + target
'http://s.xdxa.org/pay?amount=100&from=eve&to=bob'
To force Bob to pay Eve, we only need to alter the URI structure.
>>> amount = '100&from=bob&to=eve#'
>>> print ...
'.../pay?amount=100&from=bob&to=eve#&from=eve&to=bob'
The fragment (everything after the '#'
) is dropped before making the backend request.
As a result, the service's log will look completely normal.
To fix this, use a library to generate URIs.
This is typically accomplished by using either a builder or providing a map of "parameter to argument"s.
Unfortunately, libraries for RESTful URIs, where arguments may be part of the path, are more scarce. In that circumstance, concatenation + escaping may be your only option. However, if you do that, segment that code into its own domain, i.e., create a library. The Java library Handy URI Templates is a good model for generating said URIs.
Additionally, be careful in other sections of the URI. Other components of the URI have different rules, i.e., scheme and authority. If you want to know more, The Tangled Web has an entire chapter dedicated to the URL.
HTML
XSS (Cross-Site Scripting) is a vulnerability where an attacker is able to inject JavaScript into a page, which is then executed in a victim's browser. It's a subset of the attacks against poorly serialized HTML. The attack is made possible by the fact that HTML mingles mark-up and code. In fact, HTML is composed of a number of different data formats within a single document:
- HTML Tags
- JavaScript
- CSS
- URIs
When represented in a object structure, we refer to the page as a DOM (document object model).
However to get that page to the browser, it's serialized as a string and transmitted over HTTP.
It is in that process that XSS is made possible.
To make this concrete, let's take a look at an example.
What happens if an attacker includes a script tag in the URL arg
parameter?
(Again, this is a simplistic example. There are significantly more malicious XSS attacks.)
<span><?php echo $_GET['arg']; ?></span>
becomes
<span>Hello! <script>alert(1);</script></span>
The user content altered the structure of the document. To fix this, we could escape the variable prior to output. However, that is simply string concatenation. The output is exactly the same.
<span><?php echo htmlentities($_GET['arg']); ?></span>
is the same as
<? echo '<span>' . htmlentities($_GET['arg']) . '</span>';
This means, it falls prey to all the same risks as above. One missed escape, or escaping for the wrong context, and we have a vulnerability.
<span href="<?php echo htmlentities($_GET['arg']); ?>">...</span>
becomes
<span href="javascript:alert(1);">...</span>
But wait! XSS is really because a programmer failed to check for JavaScript within the user inlined content. That's true, in the implementation of HTML we know and love. If the issue wasn't truly with the failure to preserve the document structure, we could simply introduce a tag which would disable all scripts within the child element. That hasn't happened, because the issue is really about the injection. (As an aside, avoid using HTML for user supplied markup. It's a losing battle.)
So what do I recommend? Templating engines. (Keep in mind, templating engines may not automatically escape variables within your output. Escaping must be configured on a per-engine basis.)
<span>{{ input.arg }}</span>
That having been said, automatic escaping is typically not a panacea. Why? The escaping is normally context unaware, which means it's performed for HTML tags. If your templating engine allows bad markup, it's doubtful that it understands the context for which it's escaping. Where can this bite you?
- Tag Attributes
- URIs
- CSS
- JavaScript
(It is rarely safe to output content into a script tag. Avoid this whenever possible.)
Another even safer approach is to construct your page as a DOM. This will ensure tags, attributes, and possible URIs are properly escaped, but probably not CSS or Javascript. Almost no frameworks actually do this, because it's cumbersome and (probably) more cpu/memory costly.
Tangential from string concatenation, is a perfect templating engine enough? Unfortunately, browsers suck (from a security perspective). You could trace this back to the browser wars, where browsers which refused to render bad markup were labeled at fault. Regardless of the history, browsers have extremely "flexible" parsers. As a result, even if a document looks rightish, it might still contain an XSS. While I generally do not recommend sanitizing user input (it feels too much like a blacklist), input destined for HTML is an exception for the reasons outlined above. See OWASP's Sanitization Recommendations for candidate libraries.
You could easily write a book on injection attacks against HTML. For that reason, I'm going to stop here. Want to know more?
- HTML5 Security Cheatsheet
- XSS Filter Evasion Cheat Sheet
- Postcards from the post-XSS world
- The innerHTML Apocalypse
- HTTP cookies, or how not to design protocols
- Path Insecurity
- My XSS/CSRF Slide Deck
JSON
To avoid making this article much longer, I'm not going to break out a JSON example. However, I do want to mention it, because I have frequently seen engineers construct JSON from a string. Don't do this.
Embedded Variables
Note, there is no distinction between embedding a variable and string concatenation. Neither of the following two examples address the serialization problem:
<? "$first $last" === $first . ' ' . $last;
Or with sprintf like substitution:
"%s %s" % (person.first_name, person.last_name)
Logging
Before we conclude, I want to make a special note with regards to logging. In my code, casual string concatenation most frequently occurs in logging statements. Logs are typically loosely formatted, newline delimited files. Loosely structured output is mostly fine when intended for human consumption. However, this casual serialization format is a nightmare for log parsers.
For this reason, one must be careful when logging output which will eventually be consumed by some downstream process. While not strictly the responsibly of the logger, HTML log viewers have been susceptible to XSS attacks in the past.
See CWE-117 and CAPEC-106 for more information.
Clarification
This concept is not new or innovative. The weakness I'm describing is actually CWE-707: Improper Enforcement of Message or Data Structure, and it is not exclusive to string concatenation or even strings. I wag my finger at string concatenation, because it makes these mistakes easy. And consequentially, common.
Hyperboles
The title is mostly meant as tongue and cheek, but I'm quite serious in the severity of this topic. In my opinion there are only two places string concatenation should occur:
- Serialization/Deserialization Libraries
- Unformatted Data
Many security issues simply do not have an easy solution, e.g. side channel attacks. However, if engineers consistently use libraries for serialization, the majority of injection based attacks will simply go away.
Is it possible to write secure code with concatenation? Yes. However, this code will almost always be brittle. You might understand the security considerations, but the next engineer probably will not. Writing secure code isn't just preventing a certain type of attack. It's following best practice throughout the entire code base to set the standard which avoids careless mistakes.