April 17, 2015

Validate the encoding before passing strings to libcurl or glibc

Lets start with a simple example in php:

setlocale(LC_ALL, "nl_NL.UTF-8");

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $_GET["url"]);


This code is broken, can you tell how?

But it’s not just php or libcurl, lets try glibc.

#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <locale.h>

#define BUF_SIZE 500

main(int argc, char *argv[])
    struct addrinfo hints;
    struct addrinfo *result, *rp;
    int sfd, s, j;
    size_t len;
    ssize_t nread;
    char buf[BUF_SIZE];

    setlocale(LC_ALL, "nl_NL.UTF-8");

    if (argc < 3) {
        fprintf(stderr, "Usage: %s host port msg...\n", argv[0]);

    /* Obtain address(es) matching host/port */

    memset(&hints, 0, sizeof(struct addrinfo));
    hints.ai_family = AF_UNSPEC;    /* Allow IPv4 or IPv6 */
    hints.ai_socktype = SOCK_DGRAM; /* Datagram socket */
    hints.ai_flags = AI_IDN;
    hints.ai_protocol = 0;          /* Any protocol */

    s = getaddrinfo(argv[1], argv[2], &hints, &result);
    if (s != 0) {
        fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(s));

This is a slight modification of the example from the man page for getaddrinfo and it is broken in the exact same way.


The common factor is that both use libidn (well, glibc contains an in-tree copy of libidn, but the essence of it is the same). libidn is a library with various Unicode related funtions. For example, it can convert internationalized domain names (IDNs) to punycode. This is what converts яндекс.рф to xn--d1acpjx3f.xn--p1ai, which contains only characters that can be used safely by the DNS.

The idna_to_ascii_8z documentation states:

Convert UTF-8 domain name to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.

As it turns out, the effect of passing a string that is not valid UTF-8 to any of the libidn functions that expects an UTF-8 string can be disastrous. If the passed in data ends with an unfinished UTF-8 codepoint, then libidn will continue reading past the terminating null-byte. There could be unrelated information just past that byte, which then gets copied into the result. This could leak private information from the server!

For example, the UTF-8 encoding of ф is, in hex:

d1 84

In fact, any valid UTF-8 sequence that starts with d1 should always consist of 2 bytes. But if we pass:

d1 00

instead, then it will instead interpret this as if it was passed:

d1 80

and it continues reading whatever is after our input.

The locale

Some applications don’t use idna_to_ascii_8z, but idna_to_ascii_lz instead. The documentation for idna_to_ascii_lz states:

Convert domain name in the locale’s encoding to ASCII string. The domain name may contain several labels, separated by dots. The output buffer must be deallocated by the caller.

However, this is no problem if the locale is already an UTF-8 locale (which is why the examples needed the setlocale calls): if the new locale and the old locale are identical, then no conversion is happening, which means the invalid data is not caught.


The effect of the php code above when passed a domain name with invalid UTF-8 is that a DNS request is started for a domain which contains extra data.

It is possible that this data contains passwords or fragments of a key, however, it has to continue to look UTF-8-like to libidn, so it is unlikely to continue on as long as Heartbleed could (for example, multiple successive null-bytes will stop the conversion). But it could easily allow an attacker to bypass ASLR.

The stringprep functions in libidn are affected by the same issue. These are used, for example, to normalize usernames and passwords. Here, it could allow an attacker to reuse parts of the password from a previous login.

Luckily, the AI_IDN flag of glibc is off by default, and I could not find many applications that ever set it.

So who should check it?

The libidn developers show little motivation to fix this, pointing the blame to applications instead:

Applications should not pass unvalidated strings to stringprep(), it must be checked to be valid UTF-8 first. If stringprep() receives non-UTF8 inputs, I believe there are other similar serious things that can happen.

But the libcurl and glibc developers can pass on the blame to the layer above just as easily. The man page for getaddrinfo says:

AI_IDN - If this flag is specified, then the node name given in node is converted to IDN format if necessary. The source encoding is that of the current locale.

libcurl’s CURLOPT_URL says nothing about the required encoding.

This is a very messy situation, and so far nobody has shown any motivation to work on fixing it. So the best approach seems to be to fix end-applications to always validate strings to be valid in the current locale before passing them to libraries that require that. How many php developers are likely to do that? How many applications are out there that depend on getaddrinfo? Of course that’s unlikely, so I hope the glibc/libcurl/libidn developers figure something out.