Skip to content

Saving CSV with backslashed-escaping is not idempotent. #14122

Open
@deads

Description

@deads

@pdbaines and I noticed this bug.

I want Pandas to write a CSV file so that all field data is backslash escaped if the character has a special interpretation (e.g. quotes or backslashes themselves). If a quote is backslashed, it is treated as field data, rather than a special character. This is not the behavior that I am seeing.

Consider the following data frame:

df = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""], "zoo": ["1"]})
df.to_csv("out.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

When written to a file, it looks something like this:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\","1"

The quotes are properly escaped in Please "help" me, but oddly, the end-quote of the field is backslashed, but the start-quote of the field is not back-slashed.

If I read the data frame in again using exactly the same parameters,

df2 = pd.read_csv("out.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I get a data frame with both fields concatenated into the first field and the second field is NaN.

$ print(df2)
                                                text  zoo
0  Hello! Please "help" me. I cannot quote a csv....  NaN

If I instead, do the following:

df3 = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\\""""], "zoo": ["1"]})
df3.to_csv("outB.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)
df4 = pd.read_csv("outB.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I instead get a file with an odd-number of unescaped quote characters:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\\"","1"

and some unescaped quote characters are treated as data.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions