Skip to content

Conversation

pgjones
Copy link
Member

@pgjones pgjones commented Jul 30, 2018

This constrains the matching to the printable ASCII characters that
are not the space ' ' character i.e. between '!' (65, 0x21) and '~'
(126, 0x7e). Doing so will mean that a request containing bytes
outside of this range will not match and instead raise an error.

This is based on my understanding of RFC 3986.

I think I've matched the style of the file, although I'm happy for this to be edited however to fit.

@codecov-io
Copy link

codecov-io commented Jul 30, 2018

Codecov Report

Merging #69 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master    #69   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          20     20           
  Lines         905    910    +5     
  Branches      175    175           
=====================================
+ Hits          905    910    +5
Impacted Files Coverage Δ
h11/tests/test_events.py 100% <ø> (ø) ⬆️
h11/tests/test_io.py 100% <ø> (ø) ⬆️
h11/_abnf.py 100% <100%> (ø) ⬆️
h11/_events.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eec3423...3228bd2. Read the comment docs.

@njsmith
Copy link
Member

njsmith commented Jul 30, 2018

Interesting idea. It's tricky to know how much validation to do at this level – there's what the spec says, and then there's what applications expect to work. See #57 for a similar discussion about header values.

For outgoing requests: I would certainly hope that no one depends on sending raw utf8 or anything like that – that's what url-encoding is for. Though I suppose it's possible that there's some server out there that requires non-ascii URLs. But at least in this case I'm pretty sure browsers do reliably do percent encoding, unlike with headers where they happily send and receive totally non-compliant values.

For incoming requests: servers often have a bit more room to enforce requirements. I'm curious what other servers do here. It looks like nginx does allow bytes with the high bit set.

I think tentatively I'm leaning towards allowing this, but with the warning that we might later find we have to relax the checks again if it causes problems.

It would be good to have some tests to make sure this being enforced in both incoming and outgoing requests.

@pgjones pgjones force-pushed the master branch 2 times, most recently from c86827e to a0afbca Compare August 2, 2018 20:26
@pgjones
Copy link
Member Author

pgjones commented Aug 2, 2018

Thanks. I've added testing and validation on the outgoing and testing for the incoming. Looks as if the Python stdlib accepts Latin-1 as does gunicorn. I've re-read RFC 3986 and I think pchar is the relevant abnf (should I call it that in the _abnf file?) and is defined as:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

so I think latin-1 is not the spec correct thing to do. Not sure if I should change though.

Copy link
Member

@njsmith njsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following your comment about latin-1. Doesn't latin-1, in this context, mean "accept any byte sequence, don't do any validation"?

A few other nitpicks:

h11/_events.py Outdated
@@ -55,6 +58,10 @@ def __init__(self, **kwargs):
if not isinstance(self.status_code, int):
raise LocalProtocolError("status code must be integer")

if "target" in self.__dict__:
validate(request_target_re, self.__dict__["target"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just write self.target here, no need to use __dict__.

Also, this code should move to Request._validate_EventBundle.__init__ is where we put code that runs on multiple different event types, while fields that only show up in one event type get validated in that type's _validate method. (This is definitely confusing, sorry.)

@@ -94,6 +94,13 @@ def test_events():
headers=[("Host", "a"), ("Foo", "asd\x01\x02\x7f")],
http_version="1.0")

# Request target is validated
for bad_char in b"\x20\x7f\xee":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add \x00 to the list of test characters? That's the one most likely to mess people up if we let it through.

@@ -388,6 +388,14 @@ def test_reject_garbage_in_header_line():
b"Host: foo\x00bar\r\n\r\n",
None)

def test_reject_non_vchar_in_path():
for bad_char in b"\x20\x7f\xee":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

@sigmavirus24
Copy link

For outgoing requests: I would certainly hope that no one depends on sending raw utf8 or anything like that – that's what url-encoding is for. Though I suppose it's possible that there's some server out there that requires non-ascii URLs.

Requests has absolutely had bug reports because servers returned non-ascii URLs. Further, I've heard of servers putting raw utf8 characters into headers where the header value is JSON. Not to say that you have to support those, but perhaps there needs to be an escape hatch for them?

@njsmith
Copy link
Member

njsmith commented Aug 3, 2018

@sigmavirus24 This PR is specifically about the "target" field in a request (basically the request path, the thing that comes after GET on the request line). So it's only ever generated by the client and parsed by the server.

Headers are a bit different: for headers, the spec actually requires allowing bytes outside the ascii range and only forbids the ascii "control characters" (stuff like NUL and DEL and form-feed). And then as we discovered in #57, browsers actually allow control characters too, and popular software relies on this, so h11 ended up only forbidding NUL and exotic whitespace (#68). So utf8 in Location: or JSON values is fine.

The cases where this PR could cause a problem are:

  • writing a server that needs to interoperate with a client that sends invalid binary characters (like utf-8) in the target field, without doing proper %-encoding

  • writing a client that needs to interoperate with a server that insists on getting raw binary characters (like utf8) in the target field, and won't accept the proper %-encoded equivalent

It's entirely possible that something like this will come up of course, because, well, software.

Does requests provide a way to disable %-encoding and pass through invalid characters as part of the target? Have people asked for such a feature?

This constrains the matching to the printable ASCII characters that
are not the space ' ' character i.e. between '!' (65, 0x21) and '~'
(126, 0x7e). Doing so will mean that a request containing bytes
outside of this range will not match and instead raise an error.

This is based on my understanding of RFC 3986.
@pgjones
Copy link
Member Author

pgjones commented Aug 3, 2018

@njsmith Regarding latin-1, yes I was wondering if h11 should follow those other examples. Although I actually prefer h11 for the validation it does and think this is the correct thing to do.

I've updated to fix the nitpicks.

@pgjones
Copy link
Member Author

pgjones commented Aug 4, 2018

I think I understand the latin-1 usage now, PEP 3333 passes on the encoding issues to the application using latin-1.

@njsmith
Copy link
Member

njsmith commented Aug 13, 2018

OK, code looks good and my comment above has a 👍 from @sigmavirus24, so I guess this is good to go – thanks!

@njsmith njsmith merged commit 870fb83 into python-hyper:master Aug 13, 2018
@njsmith njsmith mentioned this pull request Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants