-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Bug: On Python 3 to_csv() encoding defaults to ascii if the dataframe contains special characters. #17097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Khris777 : Thanks for the report! I can address the For reference, could you show us the output of your |
The output of the
Regarding the I can read Likewise I can not read The problem is only solved by explicitely specifying |
@Khris777 : Interesting! That's definitely a little unusual. Do me a favor, could you copy + paste that output into your original issue description? That will making reviewing this a little easier. BTW, you are more than welcome to investigate what's going on here and submit a PR to fix this admittedly strange behavior. |
@gfyoung : Can you reproduce the problem on your side? |
@Khris777 : Yes, I can. |
@gfyoung When The Inputimport pandas as pd
L2 = ["AAAAA","ÄÄÄÄÄ","ßßßßß","聞聞聞聞聞"]
df2 = pd.DataFrame({"L2":L2})
df2.to_csv("test2.csv")
# Japanese Windows default encoding is cp932
print(pd.read_csv("test2.csv", encoding='cp932')) Output Unnamed: 0 L2
0 0 AAAAA
1 1 ?????
2 2 ?????
3 3 聞聞聞聞聞 |
@Licht-T : That's true, so are you suggesting then that workaround is specifying the correct encoding since |
This is a docs bug on windows. pandas.DataFrame.to_csv states:
while in fact in windows, as discussed above pandas uses system default encoding (so in my machine |
Could you open a new issue for that?
…On Mon, Feb 19, 2018 at 11:15 AM, Utumno ***@***.***> wrote:
This is a docs bug on windows. pandas.DataFrame.to_csv
<https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html>
states:
encoding : string, optional
A string representing the encoding to use in the output file,
defaults to 'ascii' on Python 2 and 'utf-8' on Python 3.
while in fact in windows, as discussed above
<#17097 (comment)>
pandas uses system default encoding (so in my machine cp1253). This can
lead to data corruption if one on python 3 trusts the docs and does not
specify encoding='utf-8' as this is the stated default
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#17097 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIreeV94bOxkNlbm6hvesTfVnilcLks5tWawkgaJpZM4OlJCX>
.
|
Will do if needed I just realized though that this may be fixed in pandas 21 - still using |
Yep no longer an issue on pandas |
Code Sample, a copy-pastable example if possible
Problem description
The to-csv doc says about
encoding
:Therefore being on Python 3 I expect
test1.csv
andtest2.csv
to beutf8
.However while
test1.csv
is encoded inutf8
,test2.csv
is encoded inascii
, if I want the correct encoding I have to explicitely add the encoding to produce the correct result astest3.csv
.Correspondingly doing
leads to
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
EDIT: Additional info from first comment:
The output of the
to_csv()
calls looks correct:Regarding the
read_csv()
part it's like this:I can read
test1.csv
andtest3.csv
fine, regardless of specifyingencoding='utf8'
or not.Likewise I can not read
test2.csv
at all, regardless of specifyingencoding='utf8'
or not. The error message is returned in both cases.The problem is only solved by explicitely specifying
encoding='utf8'
into_csv()
.EDIT 2:
I can only read
test2.csv
when I explicitely stateencoding='ansi'
, soread_csv()
definitely expectsutf-8
.Output of
pd.show_versions()
pandas: 0.20.3
The text was updated successfully, but these errors were encountered: