Skip to content

series.apply(pandas.to_datetime, convert_dtype=False) still converts dtype #14559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
radekholy24 opened this issue Nov 2, 2016 · 4 comments
Closed

Comments

@radekholy24
Copy link

A small, complete example of the issue

>>> import pandas
>>> s = pandas.Series({'a': '2012-05-01 00:00:00'})
>>> s.apply(pandas.to_datetime, convert_dtype=False)
a   2012-05-01
dtype: datetime64[ns]

Expected Output

a   2012-05-01
dtype: object

Output of pd.show_versions()

pandas: 0.19.0
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 7.1.0
setuptools: 18.0.1
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

Not sure what you are trying to do, as to_datetime acts on a full series as well, so the idiomatic thing to do is pd.to_datetime(s).

The docstring of apply says about convert_dtype:

convert_dtype : boolean, default True

Try to find better dtype for elementwise function results. If
False, leave as dtype=object

So this keyword only applies when the function works elementwise. As mentioned above, pd.to_datetime can act on the full series at once.
If you take an example function that will only work element-wise, you can see the effect of this convert_dtype keyword

In [2]: s = pd.Series(['a', 'b'])

In [3]: s
Out[3]: 
0    a
1    b
dtype: object

In [4]: def f(val):
   ...:     if val == 'a':
   ...:         return 1
   ...:     else:
   ...:         return 2

In [6]: s.apply(f)
Out[6]: 
0    1
1    2
dtype: int64

In [7]: s.apply(f, convert_dtype=False)
Out[7]: 
0    1
1    2
dtype: object

But again, your code does not feel idiomatic, so please clarify what you are trying to achieve. In many cases you don't want to keep this object dtype. Having the series a datetime64 dtype gives you access to specific functionality.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 2, 2016
@radekholy24
Copy link
Author

radekholy24 commented Nov 2, 2016

@jorisvandenbossche, I was under impression that pandas.to_datetime is applied elementwise since:

>>> isinstance(pandas.to_datetime, numpy.ufunc)
False

Anyway, s.apply(lambda x: pandas.to_datetime(x), convert_dtype=False) behaves the same and that is applied elementwise for sure.

In my case, my function receives different functions to apply on the series and thus it does not know beforehand whether it will get pandas.to_datetime or anything else. A simplified version of my code looks like:

class Foo:
    def __init__(self, generator):
        self.dataframe = generator.generate()

    def convert(self, name, converter):
        self.dataframe[name] = self.dataframe[name].apply(converter, convert_dtype=False)

In my case, it's much easier to describe the behavior of the convert method as that it preserves dtype=object rather than explaining that it applies some smart logic to change the dtype. Also it's much easier to unit test the method since a DataFrame with the same values but different dtypes do not equal and it's easier to create an "object dtyped" DataFrame than a DataFrame with each column having different dtype. In my case, code simplicity is preferred over performance optimizations.

Also, regardless of my use case, the behavior of the apply method does not match the documentation [1] and thus it's a bug either in the code or in the documentation.

[1] if not in the case of apply(pandas.to_datetime) then in the case of apply(lambda x: pandas.to_datetime(x)) (or some more complex function that may return pandas.Timestamp) for sure

@jorisvandenbossche
Copy link
Member

I understand that you don't want to distinguish between elementwise functions or not in your application, and for that the use of apply is appropriate.
But if you only want object dtype, then don't convert your data. I really don't recommend trying to keep everything as object dtype. Once you start doing manipulations with those data, data types will get deduced and you get dtypes anyway.

it's much easier to unit test the method since a DataFrame with the same values but different dtypes do not equal

you can specify not to check the dtype

it's easier to create an "object dtyped" DataFrame than a DataFrame with each column having different dtype

that is not true, as when creating a dataframe the default is to deduce the dtypes from the data you pass in

If you want to keep object dtype, you can simply do .astype(object) after the apply call (or astype(self.dataframe[name].dtype) if it is not always object dtype)


For the specifics, the reason this does not work as documented for datetimes, is this:

In [43]: pd.Series(np.array([1, 2], dtype=object))
Out[43]: 
0    1
1    2
dtype: object

In [45]: pd.Series(np.array([pd.Timestamp('2012-01-01'), pd.Timestamp('2012-01-02')], dtype=object))
Out[45]: 
0   2012-01-01
1   2012-01-02
dtype: datetime64[ns]

Under the hood, if convert_dtypes=False, on object array is returned, but when putting this in a series the object dtype is kept for numerical values, but not for datetimes.

@radekholy24
Copy link
Author

you can specify not to check the dtype

You mean using .astype(object) on both DataFrames before? Good idea, I'll consider that. Thank you.

that is not true, as when creating a dataframe the default is to deduce the dtypes from the data you pass in

In which case, I'm hitting the #14558 issue. I'll retest this idea with Pandas 0.20. Thank you.

@jorisvandenbossche, OK, I think I can use one of the approaches you have suggested. Anyway, may I ask you to reopen this in order to track the issue between the behavior and the documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants