Skip to content

BUG: Gracefully handle all utf-8 characters in json urls #17933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -998,6 +998,7 @@ I/O
- Bug in :meth:`DataFrame.to_html` in which there was no validation of the ``justify`` parameter (:issue:`17527`)
- Bug in :func:`HDFStore.select` when reading a contiguous mixed-data table featuring VLArray (:issue:`17021`)
- Bug in :func:`to_json` where several conditions (including objects with unprintable symbols, objects with deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (:issue:`14256`)
- Bug in :func:`read_json` where all utf-8 characters were not encoded properly when reading json data from a url (:issue:`17918`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is generally true (not just for read_json), so maybe amend to say for urls. If you can add a test for more readers would be great.


Plotting
^^^^^^^^
Expand Down
5 changes: 3 additions & 2 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@


if compat.PY3:
from urllib.request import urlopen, pathname2url
from urllib.request import urlopen, pathname2url, quote
_urlopen = urlopen
from urllib.parse import urlparse as parse_url
from urllib.parse import (uses_relative, uses_netloc, uses_params,
Expand All @@ -38,7 +38,7 @@
from http.client import HTTPException # noqa
else:
from urllib2 import urlopen as _urlopen
from urllib import urlencode, pathname2url # noqa
from urllib import urlencode, pathname2url, quote # noqa
from urlparse import urlparse as parse_url
from urlparse import uses_relative, uses_netloc, uses_params, urljoin
from urllib2 import URLError # noqa
Expand Down Expand Up @@ -187,6 +187,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
filepath_or_buffer = _stringify_path(filepath_or_buffer)

if _is_url(filepath_or_buffer):
filepath_or_buffer = quote(filepath_or_buffer, safe=';/?:@&=+$,')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason the default is not enough here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if yes, can you expand the test that exercises that (with another case)

req = _urlopen(filepath_or_buffer)
content_encoding = req.headers.get('Content-Encoding', None)
if content_encoding == 'gzip':
Expand Down
10 changes: 9 additions & 1 deletion pandas/tests/io/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -845,12 +845,20 @@ def test_round_trip_exception_(self):
index=df.index, columns=df.columns), df)

@network
def test_url(self):
def test_url_encoded(self):
url = 'https://api.github.com/repos/pandas-dev/pandas/issues?per_page=5' # noqa
result = read_json(url, convert_dates=True)
for c in ['created_at', 'closed_at', 'updated_at']:
assert result[c].dtype == 'datetime64[ns]'

@network
def test_url_unencoded(self):
url = ('https://api.github.com/repos/pandas-dev/pandas/issues?per_pag'
'e=5&test=fake parameter')
result = read_json(url, convert_dates=True)
for c in ['created_at', 'closed_at', 'updated_at']:
assert result[c].dtype == 'datetime64[ns]'

def test_timedelta(self):
converter = lambda x: pd.to_timedelta(x, unit='ms')

Expand Down