Constrain regex request-target matching #69

pgjones · 2018-07-30T18:42:35Z

This constrains the matching to the printable ASCII characters that
are not the space ' ' character i.e. between '!' (65, 0x21) and '~'
(126, 0x7e). Doing so will mean that a request containing bytes
outside of this range will not match and instead raise an error.

This is based on my understanding of RFC 3986.

I think I've matched the style of the file, although I'm happy for this to be edited however to fit.

codecov-io · 2018-07-30T18:44:45Z

Codecov Report

Merging #69 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master    #69   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          20     20           
  Lines         905    910    +5     
  Branches      175    175           
=====================================
+ Hits          905    910    +5

Impacted Files	Coverage Δ
h11/tests/test_events.py	`100% <ø> (ø)`	⬆️
h11/tests/test_io.py	`100% <ø> (ø)`	⬆️
h11/_abnf.py	`100% <100%> (ø)`	⬆️
h11/_events.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eec3423...3228bd2. Read the comment docs.

njsmith · 2018-07-30T23:50:34Z

Interesting idea. It's tricky to know how much validation to do at this level – there's what the spec says, and then there's what applications expect to work. See #57 for a similar discussion about header values.

For outgoing requests: I would certainly hope that no one depends on sending raw utf8 or anything like that – that's what url-encoding is for. Though I suppose it's possible that there's some server out there that requires non-ascii URLs. But at least in this case I'm pretty sure browsers do reliably do percent encoding, unlike with headers where they happily send and receive totally non-compliant values.

For incoming requests: servers often have a bit more room to enforce requirements. I'm curious what other servers do here. It looks like nginx does allow bytes with the high bit set.

I think tentatively I'm leaning towards allowing this, but with the warning that we might later find we have to relax the checks again if it causes problems.

It would be good to have some tests to make sure this being enforced in both incoming and outgoing requests.

pgjones · 2018-08-02T20:44:15Z

Thanks. I've added testing and validation on the outgoing and testing for the incoming. Looks as if the Python stdlib accepts Latin-1 as does gunicorn. I've re-read RFC 3986 and I think pchar is the relevant abnf (should I call it that in the _abnf file?) and is defined as:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

so I think latin-1 is not the spec correct thing to do. Not sure if I should change though.

njsmith

I'm not quite following your comment about latin-1. Doesn't latin-1, in this context, mean "accept any byte sequence, don't do any validation"?

A few other nitpicks:

njsmith · 2018-08-03T00:39:40Z

h11/_events.py

@@ -55,6 +58,10 @@ def __init__(self, **kwargs):
                if not isinstance(self.status_code, int):
                    raise LocalProtocolError("status code must be integer")

+        if "target" in self.__dict__:
+            validate(request_target_re, self.__dict__["target"],


You can just write self.target here, no need to use __dict__.

Also, this code should move to Request._validate – _EventBundle.__init__ is where we put code that runs on multiple different event types, while fields that only show up in one event type get validated in that type's _validate method. (This is definitely confusing, sorry.)

njsmith · 2018-08-03T00:40:09Z

h11/tests/test_events.py

@@ -94,6 +94,13 @@ def test_events():
            headers=[("Host", "a"), ("Foo", "asd\x01\x02\x7f")],
            http_version="1.0")

+    # Request target is validated
+    for bad_char in b"\x20\x7f\xee":


Can you add \x00 to the list of test characters? That's the one most likely to mess people up if we let it through.

njsmith · 2018-08-03T00:40:21Z

h11/tests/test_io.py

@@ -388,6 +388,14 @@ def test_reject_garbage_in_header_line():
           b"Host: foo\x00bar\r\n\r\n",
           None)

+def test_reject_non_vchar_in_path():
+    for bad_char in b"\x20\x7f\xee":


Same comment here.

sigmavirus24 · 2018-08-03T01:08:05Z

For outgoing requests: I would certainly hope that no one depends on sending raw utf8 or anything like that – that's what url-encoding is for. Though I suppose it's possible that there's some server out there that requires non-ascii URLs.

Requests has absolutely had bug reports because servers returned non-ascii URLs. Further, I've heard of servers putting raw utf8 characters into headers where the header value is JSON. Not to say that you have to support those, but perhaps there needs to be an escape hatch for them?

njsmith · 2018-08-03T02:02:27Z

@sigmavirus24 This PR is specifically about the "target" field in a request (basically the request path, the thing that comes after GET on the request line). So it's only ever generated by the client and parsed by the server.

Headers are a bit different: for headers, the spec actually requires allowing bytes outside the ascii range and only forbids the ascii "control characters" (stuff like NUL and DEL and form-feed). And then as we discovered in #57, browsers actually allow control characters too, and popular software relies on this, so h11 ended up only forbidding NUL and exotic whitespace (#68). So utf8 in Location: or JSON values is fine.

The cases where this PR could cause a problem are:

writing a server that needs to interoperate with a client that sends invalid binary characters (like utf-8) in the target field, without doing proper %-encoding
writing a client that needs to interoperate with a server that insists on getting raw binary characters (like utf8) in the target field, and won't accept the proper %-encoded equivalent

It's entirely possible that something like this will come up of course, because, well, software.

Does requests provide a way to disable %-encoding and pass through invalid characters as part of the target? Have people asked for such a feature?

This constrains the matching to the printable ASCII characters that are not the space ' ' character i.e. between '!' (65, 0x21) and '~' (126, 0x7e). Doing so will mean that a request containing bytes outside of this range will not match and instead raise an error. This is based on my understanding of RFC 3986.

pgjones · 2018-08-03T07:57:05Z

@njsmith Regarding latin-1, yes I was wondering if h11 should follow those other examples. Although I actually prefer h11 for the validation it does and think this is the correct thing to do.

I've updated to fix the nitpicks.

pgjones · 2018-08-04T10:16:58Z

I think I understand the latin-1 usage now, PEP 3333 passes on the encoding issues to the application using latin-1.

njsmith · 2018-08-13T06:52:26Z

OK, code looks good and my comment above has a 👍 from @sigmavirus24, so I guess this is good to go – thanks!

pgjones mentioned this pull request Jul 30, 2018

Nikto Scan Causes ascii decode issues. pgjones/hypercorn#2

Closed

pgjones force-pushed the master branch 2 times, most recently from c86827e to a0afbca Compare August 2, 2018 20:26

njsmith reviewed Aug 3, 2018

View reviewed changes

pgjones force-pushed the master branch from a0afbca to 3228bd2 Compare August 3, 2018 07:55

njsmith merged commit 870fb83 into python-hyper:master Aug 13, 2018

njsmith mentioned this pull request Apr 24, 2019

Ponder this SSRF talk #50

Closed

sethmlarson mentioned this pull request Apr 30, 2019

CRLF injection vulnerability urllib3/urllib3#1553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Constrain regex request-target matching #69

Constrain regex request-target matching #69

Uh oh!

pgjones commented Jul 30, 2018

Uh oh!

codecov-io commented Jul 30, 2018 •

edited

Loading

Uh oh!

njsmith commented Jul 30, 2018

Uh oh!

pgjones commented Aug 2, 2018

Uh oh!

njsmith left a comment

Uh oh!

njsmith Aug 3, 2018

Uh oh!

njsmith Aug 3, 2018

Uh oh!

njsmith Aug 3, 2018

Uh oh!

sigmavirus24 commented Aug 3, 2018

Uh oh!

njsmith commented Aug 3, 2018

Uh oh!

pgjones commented Aug 3, 2018 •

edited

Loading

Uh oh!

pgjones commented Aug 4, 2018

Uh oh!

njsmith commented Aug 13, 2018

Uh oh!

Uh oh!

Constrain regex request-target matching #69

Constrain regex request-target matching #69

Uh oh!

Conversation

pgjones commented Jul 30, 2018

Uh oh!

codecov-io commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njsmith commented Jul 30, 2018

Uh oh!

pgjones commented Aug 2, 2018

Uh oh!

njsmith left a comment

Choose a reason for hiding this comment

Uh oh!

njsmith Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

njsmith Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

njsmith Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

sigmavirus24 commented Aug 3, 2018

Uh oh!

njsmith commented Aug 3, 2018

Uh oh!

pgjones commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgjones commented Aug 4, 2018

Uh oh!

njsmith commented Aug 13, 2018

Uh oh!

Uh oh!

codecov-io commented Jul 30, 2018 •

edited

Loading

pgjones commented Aug 3, 2018 •

edited

Loading