-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Fix Byte Order Mark (BOM) handling in markdown display and editor. #6716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would be interested in how GitHub solves this (or not). Generally, I think all BOMs should be abolished on save, but I guess some broken software might rely on it being present (cough Excel), so there might be some value in trying to preserving it (but never rendering it). |
It's probably through some use of the chardet or similar. It's probably reasonable to pass the stuff through this - but the trouble with the chardet library is that it's not nearly perfect. |
BOM should be easily detectable via regex |
Ah but I bet that the next problem will be cp1252 related |
Please do not use regexp for this, just check first two bytes |
BOM is actually three bytes in UTF-8 and two bytes in UTF-16. Still think it's best to regex-test the unicode string instead like |
It does not matter it still be faster than doing regexp and converting byte array to string |
OK so we are passing this content through a chardet. OK so we need to adjust our "decoder" for utf-8 to simply remove the BOM if present. |
I think the most elegant solution would be:
As I said, BOM preserval is important when dealing with certain Microsoft products which use the BOM as indicator whether a file is UTF-8 or ASCII, like Excel does. |
@silverwind - OK I've got removal of the BOM on decoding sorted out. In terms of keeping the previous encoding that's a bit more difficult - we don't currently do that at all - all data created on the editor is assumed to be UTF-8 AFAIU (I certainly didn't write any encoding gadgets - doing it properly is a horrible experience.) |
We're making the assumption that everything is UTF-8 so adding it back when committing from the UI shouldn't be too hard. Read the old file, check if its first three bytes match (maybe create a shared hasBOM function to use in the template renderer as well) and if they do, add them to the saved content. Thought I won't block on this, stripping BOM is already an improvement. |
OK done. |
The attached pr will attempt to reencode to the detected charset and upon failure will default to utf8 with or without BOM as per original charset. |
Thanks, will likely test this tomorrow. |
[x]
): All.Description
When the README.md file contains a Unicode Byte Order Mark the file the first line of the file is
formatted incorrectly. This happens for all BOMs that I have tested - UTF8, Unicode and Unicode
Big Endian. The text of the file loads correctly in all cases, the only problem is with the BOM
being treated as file content and incorrectly displayed in both the rendered view and the editor.
This is a particular problem for developers using Visual Studio and other environments which
insert a BOM by default, and can be difficult to change without add-ons to the environment.
The various Unicode Byte Order Markers are metadata and should not be treated as file content. At a minimum the markdown renderer should detect and discard BOMs in the content. The markdown editor should likewise detect and remove BOMs during load and either replace them during save or save without markers.
I don't understand Go so I won't be submitting a pull request. From 10 minutes casting about in the code it looks like
ToUTF8WithErr
andToUTF8WithFallback
might be a place tostart. It looks like those are the main methods used to massage text file content into UTF8 for
both render and edit.
Screenshots
Screenshots taken from try.gitea.io site mentioned above.
The text was updated successfully, but these errors were encountered: