Security implications of URL parsing differentials

During my team’s security research on Apache2’s authentication module, my team and I identified an issue that causes the HTTP server Apache2 and modern web browsers to parse URLs differently. common problem though differential URL parsing Publicly documented, I think it didn’t get the attention it deserved. This can affect a wide range of software and introduce vulnerabilities in critical features such as authentication flows and requests to internal services.

In this blog post, I detail how there can be differential URL parsing bugs and which URL parser libraries are affected. I’ll use a recent bug discovered in mod_auth_openidc, a popular Apache2 module, to give you a real life example of this pattern and then show you how to easily detect similar bugs in your application through differential testing. With that, I hope to raise awareness of these subtle bugs and add a new item to your toolbox!

Differential URL parsing example

To understand differential URL parsing, let’s look at mod_auth_openidc, a third-party Apache2 module developed by Zmartzone. it acts as a OpenID Connect Relating PartyAllows users to authenticate and authorize against any OpenID Connect Provider,

For example, you can deploy this module to your public web properties first and only allow users to authenticate to your company’s Google account. If you want to learn more about these techniques, Okta has published an illustrated guide about Auth2 And OpenID Connect,

In form of OpenID Connect Provider It is very likely to exist on some other origin (in the HTTP sense) than where the applications are hosted, requiring users to be redirected to pass important information to them. This information often also includes URLs to redirect clients to; It is important to validate these values ​​to avoid redirecting clients to unexpected destinations: this unsafe behavior is called open redirect,

It is generally believed that open redirect bugs are not security-relevant and require user interaction to have an effect on themselves (for example, phishing). Chained with other features of applications such as OAuth flows, they can allow attackers to steal access tokens and gain victim privileges on the application.

CVE-2021-32786: Mod_auth_openidc. open redirect in

In this section, I’ve documented an open redirect issue I discovered in mod_auth_openidc, caused by a parsing difference between Apache2’s internal URL parsing methods and those effectively used by web browsers.

When validating a URL to redirect users during the refresh token request or logout phases, a method called oidc_validate_redirect_url() is called. Its implementation is apr_uri_parse(), at . it depends on [1]To extract relevant information from user-controlled parameters and populate the members of the apr_uri_t structure:

src/mod_auth_openidc.c

static apr_byte_t oidc_validate_redirect_url(request_rec *r, oidc_cfg *c,
       const char *url, apr_byte_t restrict_to_host, char **err_str,
       char **err_desc) {
   apr_uri_t uri;
   const char *c_host = NULL;
   apr_hash_index_t *hi = NULL;
 
   if (apr_uri_parse(r->pool, url, &uri) != APR_SUCCESS) {  // [1]
       *err_str = apr_pstrdup(r->pool, "Malformed URL");
       *err_desc = apr_psprintf(r->pool, "not a valid URL value: %s", url);
       oidc_error(r, "%s: %s", *err_str, *err_desc);
       return FALSE;
   }

Further checking is performed during the call to oidc_validate_redirect_url(), like so:

  • If the “secure” redirection is not explicitly configured to match the allowed list of URLs, match with the hostname (for example, the host of the current request must match the one extracted from the parameter);
  • Prevent the use of URLs starting with no slash or //, \\ to prevent vulnerabilities such as CVE-2019-3877;
  • Avoid using CR and LF characters in parameters to avoid newline injection (and eventually open redirects and cross-site scripting bugs).

However, apr_uri_parse() splits URLs based on RFC2396 and RFC3986 (with some custom behavior, e.g., userinfo parsing), while browsers try to follow the WHATWG standard of living. Each URL parser has slightly different implementation quirks, but here we are talking about two different specifications.

as stated in authority state section of the WHATWG, will set the state to this when a backslash is encountered host state (like slash would be handled). The function apr_uri_parse() will only treat it as part of userinfo because it is to the left of the last @ :


/* If there's a username:password@host:port, the @ we want is the last @...
  * too bad there's no memrchr()... [...]

   */
  
do {

   --s;

} while (s >= hostinfo && *s != '@')

Because of this parsing difference, mod_auth_openidc can be tricked into thinking that a URL is “secure” (for example, pointing to the correct domain) while browsers will follow a redirection to an unprivileged host. This behavior can be demonstrated at an endpoint such as /oauth2/callback with the parameter logout set to https://evil.destination.tld\@host.tld/: This parameter successfully passes all validation steps, and the user is redirected to https://evil.destination.tld. This is not expected behavior, and can be misused by attackers to perform advanced phishing attacks, using the victim’s trust in the domain on which mod_auth_openidc is running.

patch

While migrating to a WHATWG-compliant URL parser would require significant changes, the maintainers of mod_auth_openidc decided to add a special case to replace any backslash with a slash (69cb206):

pool, url, & uri) != APR_SUCCESS) { *err_str = apr_pstrdup(r->pool, “malformed URL”);” data-lang = “”>

--- a/src/mod_auth_openidc.c
+++ b/src/mod_auth_openidc.c
@@ -2920,12 +2920,21 @@ static int oidc_handle_logout_backchannel(request_rec *r, oidc_cfg *cfg) {
	 return rc;
 }
 
+#define OIDC_MAX_URL_LENGTH DEFAULT_LIMIT_REQUEST_LINE * 2
+
 static apr_byte_t oidc_validate_redirect_url(request_rec *r, oidc_cfg *c,
-   	 const char *url, apr_byte_t restrict_to_host, char **err_str,
+   	 const char *redirect_to_url, apr_byte_t restrict_to_host, char **err_str,
		 char **err_desc) {
	 apr_uri_t uri;
	 const char *c_host = NULL;
	 apr_hash_index_t *hi = NULL;
+    size_t i = 0;
+    char *url = apr_pstrndup(r->pool, redirect_to_url, OIDC_MAX_URL_LENGTH);
+
+    // replace potentially harmful backslashes with forward slashes
+    for (i = 0; i < strlen(url); i++)
+   	 if (url[i] == '\\')
+   		 url[i] = "http://dzone.com/";
 
	 if (apr_uri_parse(r->pool, url, &uri) != APR_SUCCESS) {
		 *err_str = apr_pstrdup(r->pool, "Malformed URL");

What’s in My Parser?

I looked at the most common in each ecosystem and classified them based on whether they followed one of the WHATWG or RFCs (simplified by RFC 3986 in the table below). Keep in mind that even though they claim to comply with these standards, there may be slight differences in their implementation, and different parsers may be used by the underlying functions.

Language parser Claims to follow… http://a.tld\@b.tld
PHP curl RFC 3986 (with additions) BTLD
PHP parse_url RFC 3986, but not completely b.tld
nodejs url.parse whatwg A.TLD
Java java.net.URL RFC 3986 b.tld
Go net/url RFC 3986 invalid user information
ruby Uri RFC 3986 Exception
python 3 screaming RFC 3986 a.tld\@b.tld
python 3 urllib3 /request RFC 3986 A.TLD

I was surprised by some of these results:

  • NodeJS chose to be WHATWG compliant in order to be compatible with browsers and refer to their legacy APIs if developers want “outdated” behavior;
  • Ruby and Go do not accept fuzzy data; They raise an error instead;
  • Python’s urllib and urllib3 are different from the rest.

The risk is even greater in microservices architectures, where different languages ​​may exchange data or be stacked against each other (for example, the Go reverse proxy before the Python backend). Full validation of the data won’t always help—after all, they’re both “valid” URLs.

Compare URL Parsers

Let’s try to find this oddity again using the difference test, even though this approach is biased because we already know that we are comparing two different peculiarities. The idea is that we will generate random test cases and parse this data with our two parsers:

  1. libapr, as used by mod_auth_openidc;
  2. Follows the WHATWG to replicate the behavior of a web browser. For example, the Python package whatwg-url avoids the hassle of interfacing this component of its vast code base at the expense of introducing new quirks.

If the output of both the libraries is different for the same input, then we are facing a parsing difference. The only drawback is that this can lead to results that are not always safety-relevant and may require progressive implementation of accurate estimation to reduce the burden of the triaging phase.

I decided to use GitLab’s pythonfuzz to simplify the creation of my test harness. Coverage is not guidance He is useful in this case, and a simple for-loop on two bytes would have been sufficient.

Testing is important for parsing differential bugs in modern architectures, as they often involve multiple parsers for the same specifications. For example, a reverse proxy may make a decision based on an incoming request, but the application behind it may interpret it differently – a great example of the impact of a similar bug on GitLab was documented by Jrn Schniewicz ( “How to take advantage of parser differentials”)

As you might already expect, libapr is a C library, and whatwg-url is written in Python: my team and I needed to interface both libraries in a test harness using CFFI. We generated the correct structures needed for apr_uri_parse using bindgen, then added simple projections to detect any security-relevant discrepancies and raise exceptions if so.

For example, the team simply inserted a random payload between the intended domain and an unintended domain and raised an exception if libapr did Correct one but Whatwg-url the Wrong One:

MY_DOMAIN = b'evil.tld'
VALID_DOMAIN = b'good.tld'

def fuzz(buf):
     for testcase in [
        b'http://' + VALID_DOMAIN + buf + MY_DOMAIN,
        b'http://' + MY_DOMAIN + buf + VALID_DOMAIN,
     ]:
     # [...]
     apr.apr_initialize()
     apr.apr_pool_create_ex(pool_p, ffi.NULL, ffi.NULL, ffi.NULL)
    	if apr.apr_uri_parse(pool_p[0], uri, res) == 0 and res.hostname != ffi.NULL:
                res_apr = normalize(ffi.string(res.hostname))
                if res_apr == VALID_DOMAIN.decode('ascii') and MY_DOMAIN.decode('ascii') in res_whatwg and b'\x00' not in testcase:
                    print(f"Found! {res_apr=} vs {res_whatwg=}, {testcase=}")
                    raise Exception()

Running this harness for a few seconds gives the same sequence that my team did in the first part of this article.

$ python3 ./whatwg_fuzz.py
#0 READ units: 1
#1 NEW     cov: 0 corp: 1 exec/s: 4 rss: 37.83984375 MB
[...]
#1156 NEW     cov: 1844 corp: 14 exec/s: 284 rss: 45.890625 MB
Found! res_apr="good.tld" vs res_whatwg='evil.tld', testcase=b'http://evil.tld\\@good.tld'
sample was written to crash-a5c892850b7fa58987e5a7d039b84c1e0b8a8c2a7e1a5ff4dabd427c182ba81e
sample = 5c40
$ cat crash-a5c892850b7fa58987e5a7d039b84c1e0b8a8c2a7e1a5ff4dabd427c182ba81e
\@

It’s certainly an over-engineered example of fuzzing to parse the difference, but it remains simple enough to be implemented in minutes during development or security research.

Time

Date action
2021-07-22 We report two bugs to the maintainers of mod_auth_openidc.
2021-07-22 The seller acknowledges the vulnerabilities.
2021-07-22 mod_auth_openidc 2.4.9 has been released, and GitHub assigns CVE-2021-32786 to the issue.

In this article, I have presented a great example of a parsing differential bug which is very common and is easily identifiable across all applications. Also, I looked at commonly used URL parser libraries and how bugs like this affect them. I’ve learned that disallowing ambiguous input is safer than trying to parse it incorrectly.

I also demonstrated that automating the discovery of such problems is a relatively easy task for developers and security researchers. The \@ sequence is also something to consider when working with URLs to prevent open redirects and SSRF vulnerabilities during black box testing! This is just one example, and there are many more quirks left as an exercise to discover!

I’d like to thank the maintainers of mod_auth_openidc who accepted and corrected our report in less than 24 hours.

Leave a Comment