Validate the encoding before passing strings to libcurl or glibc
Lets start with a simple example in php:
setlocale(LC_ALL, "nl_NL.UTF-8"); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $_GET["url"]); curl_exec($ch);
This code is broken, can you tell how?
But it’s not just php or libcurl, lets try glibc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
This is a slight modification of the example from the man page for
getaddrinfo and it is broken in the exact same way.
The common factor is that both use libidn (well, glibc contains an in-tree copy of libidn, but the essence of it is the same). libidn is a library with various Unicode related funtions. For example, it can convert internationalized domain names (IDNs) to punycode. This is what converts
xn--d1acpjx3f.xn--p1ai, which contains only characters that can be used safely by the DNS.
idna_to_ascii_8z documentation states:
Convert UTF-8 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
As it turns out, the effect of passing a string that is not valid UTF-8 to any of the libidn functions that expects an UTF-8 string can be disastrous. If the passed in data ends with an unfinished UTF-8 codepoint, then libidn will continue reading past the terminating null-byte. There could be unrelated information just past that byte, which then gets copied into the result. This could leak private information from the server!
For example, the UTF-8 encoding of
ф is, in hex:
In fact, any valid UTF-8 sequence that starts with
d1 should always consist of 2 bytes. But if we pass:
instead, then it will instead interpret this as if it was passed:
and it continues reading whatever is after our input.
Some applications don’t use
idna_to_ascii_lz instead. The documentation for
Convert domain name in the locale’s encoding to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.
However, this is no problem if the locale is already an UTF-8 locale (which is why the examples needed the
setlocale calls): if the new locale and the old locale are identical, then no conversion is happening, which means the invalid data is not caught.
The effect of the php code above when passed a domain name with invalid UTF-8 is that a DNS request is started for a domain which contains extra data.
It is possible that this data contains passwords or fragments of a key, however, it has to continue to look UTF-8-like to libidn, so it is unlikely to continue on as long as Heartbleed could (for example, multiple successive null-bytes will stop the conversion). But it could easily allow an attacker to bypass ASLR.
stringprep functions in libidn are affected by the same issue. These are used, for example, to normalize usernames and passwords. Here, it could allow an attacker to reuse parts of the password from a previous login.
AI_IDN flag of glibc is off by default, and I could not find many applications that ever set it.
So who should check it?
The libidn developers show little motivation to fix this, pointing the blame to applications instead:
Applications should not pass unvalidated strings to stringprep(), it must be checked to be valid UTF-8 first. If stringprep() receives non-UTF8 inputs, I believe there are other similar serious things that can happen.
But the libcurl and glibc developers can pass on the blame to the layer above just as easily. The man page for
AI_IDN - If this flag is specified, then the node name given in node is converted to IDN format if necessary. The source encoding is that of the current locale.
CURLOPT_URL says nothing about the required encoding.
This is a very messy situation, and so far nobody has shown any motivation to work on fixing it. So the best approach seems to be to fix end-applications to always validate strings to be valid in the current locale before passing them to libraries that require that. How many php developers are likely to do that? How many applications are out there that depend on
getaddrinfo? Of course that’s unlikely, so I hope the glibc/libcurl/libidn developers figure something out.
- The Wrong Number Attack
- Common DH groups
- Out with octopress, in with Hakyll
- Validate the encoding before passing strings to libcurl or glibc
- Multi-end to multi-end encryption
Do you appreciate what you read? You can donate using Bitcoin:
Copyright © 2016 - Thijs Alkemade
Site proudly generated by Hakyll.