-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Saving CSV with backslashed-escaping is not idempotent. #14122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
>>> import csv
>>> from pandas import DataFrame
>>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""],
"zoo": ["1"]})
>>> df
text zoo
0 Hello! Please "help" me. I cannot quote a csv.\ 1 Why did you put an backslash character there at the end? Let's remove it: >>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv."""],
"zoo": ["1"]})
>>> df
text zoo
0 Hello! Please "help" me. I cannot quote a csv. 1
>>> print(df.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8",
escapechar='\\', doublequote=False)))
"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.","1" You get that backslash at the end because you put it there. @jreback : This is not a bug and can be closed. |
@gfyoung: I put a backslash there to show that backslash escaping does not work for all possible inputs. It seems very reasonable for an implementation to simply backslash the backslash as is the case with C, Java, Python shell, JSON, and string literals as well as the output of Python A lack of idempotency could be a security concern as it could affect the availability and integrity of an application. |
The |
@gfyoung |
@deads I am not convinced that your example should be lossless. csv is a pretty lossy format, esp with all of the options you have selected. Can you do this example with the python csv reader losslessly? |
This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be
then I fiddled with R and it doesn't seem to do much better
Another example of CSV not being a high fidelity data interchange tool =| Also, I do not think it is fair to say
to someone reporting behavior that looks like a bug. This would seem buggy to me if I ran across it in production (presuming this came out of some kind of real world use). "Just change your input" is easy to say until the data in question is machine-generated (and may contain backslashes). |
@wesm : Your comment is a little presumptive because I did not realize that that was his point with the extraneous backslash. Before jumping to conclusions as you did about me "brushing off" this problem as a cop-out, I would suggest that you read the original issue. With regards to your point about the bug, I am not surprised that this issue persists with |
I seem to have problems with quoting and escaping, too. Has anything happened since 2k16? |
@black-snow : I don't believe anything has changed with this issue unfortunately. Given that the escaping and writing is handled by Python Have a look at the examples and see if they still persist today. Then also post your example code, and we can have a look. |
Thanks for the quick reply @gfyoung ! I've already fixed my issue. Apparently pandas needs to be told that quotes inside a quoted field are escaped with a backslash. |
Awesome! Mind sharing your code-sample for reference? |
Sure. There's no real csv standard but I'm used to certain defaults i. e. delimiter is
Thought this issue would be related but apparently it's not. |
@black-snow : You'll need to provide us the file (if possible). Can't run that code-sample as is. 😄 |
I cannot, sadly, it's business internal stuff. But without the escapechar pandas should already fail on this (not tested):
|
Confirmation would be nice. |
I'm also having this problem. Setting escapechar to backsplash doesn't fix it.
This results in the following code in the file:
Now, if I'm trying to read the same file using backslash as escape char I get erroneous result: bar = test test ",aa" I believe setting "" as escape character should result in "" being escaped by "" |
I guess it's related to this bug https://bugs.python.org/issue12178 Opened in 2011 |
Looks like this is finally getting fixed in python 3.10?
R's |
Uh oh!
There was an error while loading. Please reload this page.
@pdbaines and I noticed this bug.
I want Pandas to write a CSV file so that all field data is backslash escaped if the character has a special interpretation (e.g. quotes or backslashes themselves). If a quote is backslashed, it is treated as field data, rather than a special character. This is not the behavior that I am seeing.
Consider the following data frame:
When written to a file, it looks something like this:
The quotes are properly escaped in
Please "help" me
, but oddly, the end-quote of the field is backslashed, but the start-quote of the field is not back-slashed.If I read the data frame in again using exactly the same parameters,
I get a data frame with both fields concatenated into the first field and the second field is NaN.
If I instead, do the following:
I instead get a file with an odd-number of unescaped quote characters:
and some unescaped quote characters are treated as data.
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: