-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
urllib.request.urlopen does not handle non-ASCII characters #48241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tested on python-3.0rc1 -- Linux Fedora 9 I wanted to make sure that python3.0 would handle url's in different from urllib.request import urlopen
url = 'http://localhost/u/½ñ.html'
urlopen(url.encode('utf-8')).read() Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python3.0/urllib/request.py", line 350, in open
req.timeout = timeout
AttributeError: 'bytes' object has no attribute 'timeout' The same thing happens if I give None for the two optional arguments Next I tried using a raw Unicode string: >>> urlopen(url).read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python3.0/urllib/request.py", line 359, in open
response = self._open(req, data)
File "/usr/lib/python3.0/urllib/request.py", line 377, in _open
'_open', req)
File "/usr/lib/python3.0/urllib/request.py", line 337, in _call_chain
result = func(*args)
File "/usr/lib/python3.0/urllib/request.py", line 1082, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.0/urllib/request.py", line 1068, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib/python3.0/http/client.py", line 843, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.0/http/client.py", line 860, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.0/http/client.py", line 751, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position
7-8: ordinal not in range(128) So, in python-3.0rc1, this method is badly broken. |
As I read RFC 2396, 1.5: "A URI is a sequence of characters from a very 2.4: "Data must be escaped if it does not have a representation using an So your URL string is invalid. You need to escape the characters properly. (RFC 2396 is what the HTTP RFC cites as its authority on URLs.) |
Possibly. This is a change from python-2.x's urlopen() which escaped Without such a function, a whole lot of code bases will have to reinvent |
It's not immediately clear to me how an auto-quote function can be Best to think of this as a difference from 2.x. |
The purpose of such a function would be to take something that is not a My first, naive thought is that if the input can be parsed by What are example inputs that you are concerned about? I'll see if I can |
I'm not concerned about any example inputs. I was just trying to On the other hand, the IRI spec (RFC 3897) is another thing we might |
Oh, that's cool. I've been fine with this being a request for a needed I think iri's are a distraction here, though. The RFC for iris even |
I think Toshio's usecase is important enough to deserve a fix (patch urlopen -> _opener -> open -> _open -> _call_chain -> http_open -> >>> from urllib.request import urlopen
>>> url = 'http://localhost/ñ.html'
>>> urlopen(url).read()
Traceback (most recent call last):
[...]
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in
position 5: ordinal not in range(128)
If the newbie isn't completely lost by then, how about:
>>> from urllib.parse import quote
>>> urlopen(quote(url)).read()
Traceback (most recent call last):
[...]
ValueError: unknown url type: http%3A//localhost/%C3%B1.html |
This is a patch against 3.2 adding urllib.parse.quote_uri It splits the URI in 5 parts (protocol, authentication, hostname, port and path) then runs urllib.parse.quote on the path and encodes the hostname to punycode if it's not in ascii. It's not perfect, but should be usable in most cases. |
hello |
bpo-9679: Focusses on encoding just the DNS name Andreas’s patch just proposes a new function called quote_uri(). It would need documentation. We already have a quote() and quote_plus() function. Since it sounds like this is for IRIs (https://tools.ietf.org/html/rfc3987), would it be more appropriate to call it quote_iri()? See revision cb09fdef19f5, especially the quote(safe=...) parameter, for how I avoided the double encoding problem. |
Changed the patch after pointers from vadmium. |
I’m not really an expert on non-ASCII URLs / IRIs. Maybe it is obvious to other people that this is a good general implementation, but for me to thoroughly review it I would need time to research the relevant RFCs, other implementations, suitability for the URL schemes listed at <https://docs.python.org/dev/library/urllib.parse.html\>, security implications, etc. One problem problem with using urlunsplit() is it would strip empty URL components, e.g. quote_iri("http://example/file#") -> "http://example/file". See bpo-22852. This is highlighted by the file:///[. . .] → file:/[. . .] test case. FYI Martin Panter and vadmium are both just me, no need to get too excited. :) I just updated my settings for Rietveld (code review), so hopefully that is more obvious now. |
I believe the last time this subject was discussed the conclusion was that we really needed a full IRI module that conformed to the relevant RFCs, and that putting something on pypi would be one way to get there. Someone should research the existing packages. It might be that we need something simpler than what exists, but whatever we do should be informed by what exists, I think. |
I am suggesting that this be changed to a documentation issue. |
Agreed. The domain name would have to be encoded in Punycode, while the rest of the path uses percent-encoding, which depends on the byte representation that the server expects. It's unclear in general to which bytes, say, "ł" should translate on the server side. Documenting that we're not doing automatic encoding of either domains or path elements is the most reasonable way forward given that it's been 15 years since this issue got created. |
@ambv Is this enough? Open the URL url, which can be either a string containing a valid, properly encoded, URL or a Request object. I am attempting to avoid specifically calling out the encodings to use as it may be difficult to communicate succinctly and they may change over time. |
Yes, looks good. Let's see it on a PR. |
…d Request (#103855) Co-authored-by: Łukasz Langa <[email protected]>
…pen and Request (pythonGH-103855) (cherry picked from commit 44010d0) Co-authored-by: Michael Blahay <[email protected]> Co-authored-by: Łukasz Langa <[email protected]>
…open and Request (GH-103855) (#103891) (cherry picked from commit 44010d0) Co-authored-by: Michael Blahay <[email protected]> Co-authored-by: Łukasz Langa <[email protected]>
Closes #64758 (duplicate) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: