-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Series read_json tries to convert all column values to dates even when using keep_default_dates=True, if one column has an na value #49585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey I'm a student at the University of Michigan and was looking to contribute for a course project. However, this is my first time contributing to a open-source project. Do you think I would be able to take this one? |
Hey @efagerberg I'm a new contributor to this project. Would you be able to walk me through where you think this bug might be located in the codebase? |
Sure, I have not contributed to pandas myself but here is the trace I can see:
There is also a test file which you can likely use to replicate the issue and help you validate you have fixed it here: pandas/pandas/tests/io/test_common.py Lines 289 to 335 in 0cebd75
|
Thanks @efagerberg. I have looked at the issue for a little bit and notice that we are able to solve the issue by adding in
I think this issue might just actually be with how read_json works since its default behavior is to always have Not sure if this change is too drastic as it might change the default behavior of pandas that is currently expected. Let me know what you think. |
One trickiness to just changing the default is that people using older versions may suddenly get string dates when before they were parsed so in that way it is not backwards compatible. It may be advisable to do more analysis of the whole series to get more signal if the column is a date or not. In my example it is pretty nebulous. So I would expect pandas to make less assumptions. |
Thanks for the insight @efagerberg. After going through and debugging the code side by side for the two examples I have noticed that pandas tries to figure out if our code is "nansafe" before going ahead and parsing the json string into dates or int64. Since pandas figures out that a "None" is present in the dataset I am thinking that it goes ahead and disregards that "None", and converts the rest of the json. Pandas doesn't get a chance to look over the json data passed without None value present. Instead of reworking the entire logic of how pandas figures out what kind of data is present within some data passed into the Currently, if I understand this may not be backwards compatible, but I think in order to:
This switch is the best solution I can propose. |
That seems like a reasonable plan to me. |
Sounds good. Would you be able to assign me this task or is there some way I can do it myself? |
Hmm I can't do it on my side it seems like only maintainers would be able to do it. |
take |
Thanks for the report Changing the default doesn't solve the issue and would need a deprecation cycle anyway |
Thanks @MarcoGorelli. I am a new contributor so I am looking for a little help on this issue. Would you have ideas or proposals about how I can go about solving this issue? At this point I am a little stumped. Specifically I am having a hard time navigating the pandas code base and would really appreciate if you could point me in the direction of where you think this issue might be located. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When a series has a column that could be parsed as a date, and when there is another column with an na value,
read_json
will convert all columns to datetimes.Expected Behavior
Ideally none of the columns would be parsed as dates, unless I set
keep_default_dates=False
or I do not supply it.Installed Versions
pandas : 1.5.1
numpy : 1.22.2
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.1.2
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.10.0
gcsfs : None
matplotlib : None
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: