-
Notifications
You must be signed in to change notification settings - Fork 80
Constrain regex request-target matching #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #69 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 20 20
Lines 905 910 +5
Branches 175 175
=====================================
+ Hits 905 910 +5
Continue to review full report at Codecov.
|
Interesting idea. It's tricky to know how much validation to do at this level – there's what the spec says, and then there's what applications expect to work. See #57 for a similar discussion about header values. For outgoing requests: I would certainly hope that no one depends on sending raw utf8 or anything like that – that's what url-encoding is for. Though I suppose it's possible that there's some server out there that requires non-ascii URLs. But at least in this case I'm pretty sure browsers do reliably do percent encoding, unlike with headers where they happily send and receive totally non-compliant values. For incoming requests: servers often have a bit more room to enforce requirements. I'm curious what other servers do here. It looks like nginx does allow bytes with the high bit set. I think tentatively I'm leaning towards allowing this, but with the warning that we might later find we have to relax the checks again if it causes problems. It would be good to have some tests to make sure this being enforced in both incoming and outgoing requests. |
c86827e
to
a0afbca
Compare
Thanks. I've added testing and validation on the outgoing and testing for the incoming. Looks as if the Python stdlib accepts Latin-1 as does gunicorn. I've re-read RFC 3986 and I think
so I think latin-1 is not the spec correct thing to do. Not sure if I should change though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following your comment about latin-1. Doesn't latin-1, in this context, mean "accept any byte sequence, don't do any validation"?
A few other nitpicks:
h11/_events.py
Outdated
@@ -55,6 +58,10 @@ def __init__(self, **kwargs): | |||
if not isinstance(self.status_code, int): | |||
raise LocalProtocolError("status code must be integer") | |||
|
|||
if "target" in self.__dict__: | |||
validate(request_target_re, self.__dict__["target"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just write self.target
here, no need to use __dict__
.
Also, this code should move to Request._validate
– _EventBundle.__init__
is where we put code that runs on multiple different event types, while fields that only show up in one event type get validated in that type's _validate
method. (This is definitely confusing, sorry.)
h11/tests/test_events.py
Outdated
@@ -94,6 +94,13 @@ def test_events(): | |||
headers=[("Host", "a"), ("Foo", "asd\x01\x02\x7f")], | |||
http_version="1.0") | |||
|
|||
# Request target is validated | |||
for bad_char in b"\x20\x7f\xee": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add \x00
to the list of test characters? That's the one most likely to mess people up if we let it through.
h11/tests/test_io.py
Outdated
@@ -388,6 +388,14 @@ def test_reject_garbage_in_header_line(): | |||
b"Host: foo\x00bar\r\n\r\n", | |||
None) | |||
|
|||
def test_reject_non_vchar_in_path(): | |||
for bad_char in b"\x20\x7f\xee": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here.
Requests has absolutely had bug reports because servers returned non-ascii URLs. Further, I've heard of servers putting raw utf8 characters into headers where the header value is JSON. Not to say that you have to support those, but perhaps there needs to be an escape hatch for them? |
@sigmavirus24 This PR is specifically about the "target" field in a request (basically the request path, the thing that comes after Headers are a bit different: for headers, the spec actually requires allowing bytes outside the ascii range and only forbids the ascii "control characters" (stuff like NUL and DEL and form-feed). And then as we discovered in #57, browsers actually allow control characters too, and popular software relies on this, so h11 ended up only forbidding NUL and exotic whitespace (#68). So utf8 in The cases where this PR could cause a problem are:
It's entirely possible that something like this will come up of course, because, well, software. Does requests provide a way to disable %-encoding and pass through invalid characters as part of the target? Have people asked for such a feature? |
This constrains the matching to the printable ASCII characters that are not the space ' ' character i.e. between '!' (65, 0x21) and '~' (126, 0x7e). Doing so will mean that a request containing bytes outside of this range will not match and instead raise an error. This is based on my understanding of RFC 3986.
@njsmith Regarding latin-1, yes I was wondering if h11 should follow those other examples. Although I actually prefer h11 for the validation it does and think this is the correct thing to do. I've updated to fix the nitpicks. |
I think I understand the latin-1 usage now, PEP 3333 passes on the encoding issues to the application using latin-1. |
OK, code looks good and my comment above has a 👍 from @sigmavirus24, so I guess this is good to go – thanks! |
This constrains the matching to the printable ASCII characters that
are not the space ' ' character i.e. between '!' (65, 0x21) and '~'
(126, 0x7e). Doing so will mean that a request containing bytes
outside of this range will not match and instead raise an error.
This is based on my understanding of RFC 3986.
I think I've matched the style of the file, although I'm happy for this to be edited however to fit.