Skip to content

BUG: to_json()/read_json() can't correctly dump/load numbers requiring >15 digits of precision #38437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mjuric opened this issue Dec 13, 2020 · 5 comments · Fixed by #54100
Closed
Labels
Docs IO JSON read_json, to_json, json_normalize

Comments

@mjuric
Copy link

mjuric commented Dec 13, 2020

Code Sample, a copy-pastable example

Demonstration of the serialization issue:

import pandas as pd

df = pd.DataFrame([0.9884619112598676])
js = df.to_json(double_precision=15)   # raises exception if double_precision>=16 is set

print(f"orig:           { df[0][0]} ({ df[0].dtypes})")
print(f"JSON: {js}")

Output:

orig:           0.9884619112598676 (float64)
JSON: {"0":{"0":0.988461911259868}}

Demonstration that deserialization silently disregards the last digit:

import numpy as np

js = '{"0":{"0":0.9884619112598676}}'
df = pd.read_json(js)
flt = np.float64("0.9884619112598676")
print(f"  JSON: {js}")
print(f" numpy:           {flt}")
print(f"Pandas:           {df[0][0]}  ({df[0].dtypes})")

Output:

  JSON: {"0":{"0":0.9884619112598676}}
 numpy:           0.9884619112598676
Pandas:           0.988461911259867  (float64)

Problem description

64-bit floating point numbers require up to 17 decimal digits to be fully round-tripped to textual representation and back (e.g., see https://stackoverflow.com/questions/6118231/why-do-i-need-17-significant-digits-and-not-16-to-represent-a-double/). Pandas' ujson-based decoder cuts them off at 15 digits, causing loss of precision. This introduces inconsistencies when a pandas dataframe is transmitted from point A to point B via different serializations vs. when it's not (e.g., in our case, this issue cropped up while validating a REST API for a near-Earth asteroid orbit computation service).

I traced this down to an old version of ultrajsonenc.c that's been imported to Pandas code and forces this cut. Modern versions don't seem to have this limitation (and do away with double_precision argument to ujson.dump alltogether) -- e.g., see here.

Expected Output

Modern ujson seems to handle this fine, keeping the required precision:

import ujson
print(ujson.__version__)
print(ujson.dumps(0.9884619112598676))

>>> 4.0.1
>>> 0.9884619112598676

A solution may be to update the version shipped with Pandas.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.8.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.1
setuptools : 49.6.0.post20201009
Cython : None
pytest : 6.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

@mjuric mjuric added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2020
@mzeitlin11 mzeitlin11 added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 24, 2020
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Dec 24, 2020
@mzeitlin11
Copy link
Member

Thanks @mjuric for the detailed report!

Not sure this is an easy fix ... a naive (and probably not smart even if it worked) hope of just changing the header definition of JSON_DOUBLE_MAX_DECIMALS to 17 does not fix this issue. Pulling over an updated version of ujson seems reasonable, but would require reintegration of pandas-specific changes to the code. Plus any API change required by the updated ujson version would require deprecations.

@mzeitlin11
Copy link
Member

Looks like this update from ujson comes from doing conversion with https://github.com/google/double-conversion, so I think using updated ujson code would also require another dependency.

@mjuric
Copy link
Author

mjuric commented Jan 4, 2021

Thanks @mzeitlin11 for looking into this!

I see your point about the update being non-trivial (and requiring deprecations) :(.

Could a temporary workaround may be to document the current behavior? It took me awhile to track down that the loss of precision on read from JSON is coming from pandas -- that isn't documented right now. Setting double_precision too high at least throws an exception.

@mzeitlin11 mzeitlin11 added Warnings Warnings that appear or should be added to pandas and removed Bug labels Jan 17, 2021
@mzeitlin11
Copy link
Member

mzeitlin11 commented Jan 17, 2021

Definitely! A PR to document this limitation would be welcome.

@mzeitlin11 mzeitlin11 added Docs and removed Warnings Warnings that appear or should be added to pandas labels Jan 24, 2021
@sappersapper
Copy link

sappersapper commented Sep 3, 2022

I encountered a bug with large truncation error:

import pandas as pd

pd.Series([3.1415926535897933e-15]).to_json(double_precision=15)

output:
'{"0":0.000000000000003}'

pandas version: 1.4.2/1.3.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants