Skip to content

Bug: On Python 3 to_csv() encoding defaults to ascii if the dataframe contains special characters. #17097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Khris777 opened this issue Jul 27, 2017 · 11 comments · Fixed by #17821
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@Khris777
Copy link

Khris777 commented Jul 27, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd

L1 = ["AAAAA","BBBBB","TTTTT","77777"]
df1 = pd.DataFrame({"L1":L1})
df1.to_csv("test1.csv")

L2 = ["AAAAA","ÄÄÄÄÄ","ßßßßß","聞聞聞聞聞"]
df2 = pd.DataFrame({"L2":L2})
df2.to_csv("test2.csv")

df2.to_csv("test3.csv",encoding='utf8')

Problem description

The to-csv doc says about encoding:

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

Therefore being on Python 3 I expect test1.csv and test2.csv to be utf8.

However while test1.csv is encoded in utf8, test2.csv is encoded in ascii, if I want the correct encoding I have to explicitely add the encoding to produce the correct result as test3.csv.

Correspondingly doing

pd.read_csv("test2.csv")

leads to

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

EDIT: Additional info from first comment:

The output of the to_csv() calls looks correct:

df1.to_csv()
Out[9]: ',L1\n0,AAAAA\n1,BBBBB\n2,TTTTT\n3,77777\n'

df2.to_csv()
Out[10]: ',L2\n0,AAAAA\n1,ÄÄÄÄÄ\n2,ßßßßß\n3,聞聞聞聞聞\n'

Regarding the read_csv() part it's like this:

I can read test1.csv and test3.csv fine, regardless of specifying encoding='utf8' or not.

Likewise I can not read test2.csv at all, regardless of specifying encoding='utf8' or not. The error message is returned in both cases.

The problem is only solved by explicitely specifying encoding='utf8' in to_csv().

EDIT 2:

I can only read test2.csv when I explicitely state encoding='ansi', so read_csv() definitely expects utf-8.

Output of pd.show_versions()

python: 3.6.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64

pandas: 0.20.3

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

@Khris777 : Thanks for the report! I can address the read_csv part in that we don't use a default encoding for that part (per the docs), so you would be expected to explicitly specify UTF8. As for to_csv, that does seem a little strange...

For reference, could you show us the output of your to_csv calls? You can do that by just typing df.to_csv() without a filepath, and it will print out the CSV file to your console / terminal.

@Khris777
Copy link
Author

Khris777 commented Jul 28, 2017

The output of the to_csv() calls looks correct:

df1.to_csv()
Out[9]: ',L1\n0,AAAAA\n1,BBBBB\n2,TTTTT\n3,77777\n'

df2.to_csv()
Out[10]: ',L2\n0,AAAAA\n1,ÄÄÄÄÄ\n2,ßßßßß\n3,聞聞聞聞聞\n'

Regarding the read_csv() part it's like this:

I can read test1.csv and test3.csv fine, regardless of specifying encoding='utf8' or not.

Likewise I can not read test2.csv at all, regardless of specifying encoding='utf8' or not. The error message is returned in both cases.

The problem is only solved by explicitely specifying encoding='utf8' in to_csv().

@gfyoung
Copy link
Member

gfyoung commented Jul 28, 2017

@Khris777 : Interesting! That's definitely a little unusual. Do me a favor, could you copy + paste that output into your original issue description? That will making reviewing this a little easier.

BTW, you are more than welcome to investigate what's going on here and submit a PR to fix this admittedly strange behavior.

@gfyoung gfyoung added the IO CSV read_csv, to_csv label Jul 28, 2017
@Khris777
Copy link
Author

@gfyoung : Can you reproduce the problem on your side?

@gfyoung
Copy link
Member

gfyoung commented Jul 28, 2017

@Khris777 : Yes, I can.

@Licht-T
Copy link
Contributor

Licht-T commented Oct 8, 2017

@gfyoung When encoding=None, write a file by csv library.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1630

The csv library seems to write a file by system default encoding, at least on Windows.
So I cannot reproduce the example on Mac which default language is Japanese, but can do on Windows 10 Japanese edition.
Also I found that we can read the file by the system default encoding.

Input

import pandas as pd

L2 = ["AAAAA","ÄÄÄÄÄ","ßßßßß","聞聞聞聞聞"]
df2 = pd.DataFrame({"L2":L2})
df2.to_csv("test2.csv")

# Japanese Windows default encoding is cp932
print(pd.read_csv("test2.csv", encoding='cp932'))

Output

   Unnamed: 0     L2
0           0  AAAAA
1           1  ?????
2           2  ?????
3           3  聞聞聞聞聞

@gfyoung
Copy link
Member

gfyoung commented Oct 8, 2017

@Licht-T : That's true, so are you suggesting then that workaround is specifying the correct encoding since UTF8 is evidently insufficient?

@Utumno
Copy link

Utumno commented Feb 19, 2018

This is a docs bug on windows. pandas.DataFrame.to_csv states:

    encoding : string, optional
        A string representing the encoding to use in the output file,
        defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.

while in fact in windows, as discussed above pandas uses system default encoding (so in my machine cp1253). This can lead to data corruption if one on python 3 trusts the docs and does not specify encoding='utf-8' as this is the stated default

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 19, 2018 via email

@Utumno
Copy link

Utumno commented Feb 19, 2018

Will do if needed I just realized though that this may be fixed in pandas 21 - still using 0.20.3 - let me test with newer one

@Utumno
Copy link

Utumno commented Feb 19, 2018

Yep no longer an issue on pandas 0.22.0 - sorry for the noise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants