-
Notifications
You must be signed in to change notification settings - Fork 429
Detect content-charset and ensure the presence of correct metadata #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to The W3C:
So first I don't see that this is a bug, but a feature request, because Tidy can't determine whether or not the page is delivered without the proper content type. This leaves two points for discussion:
That would address just this request, but opens up a larger can of worms that we would have to address (I'm not saying "no", just pointing out that it's not as simple as limiting ourselves to the above). Right now Tidy doesn't try to interpret
In general we defer to the user's options; but
I think the specific request is fairly simple, but if we're going to start getting involved with character sets, it shouldn't be piecemeal; we really need to have a very good specification for desired behavior given multiple conditions, including how we deal with |
@balthisar This affects not just Kannada but any input content encoded using 'utf-8' (in this case). Please note the
Regarding
When compatibility with old browsers is desired we could use both
|
Oh, I get that. If our output-encoding is UTF8, we could simply add a meta charset, or ensure there's one already there, just like we ensure that <head> is always there, regardless of whether we detect any characters that require UTF8. But thinking of the bigger picture, Tidy is still HTML<5 compatible, and it supports a large number of character sets. I personally believe in an all-UTF world, but I can't force that on all users of Tidy, and I'm not certain if we want Tidy to treat UTF8 as a special, privileged child. Or maybe we do. Mostly my point is that this certainly merits a larger discussion than exceeds my own personal whims (which we be to support nothing but UTF8 output and force the world to modernize -- see why my own personal whims aren't good enough?). To provoke more discussion (don't worry, others will be by), I might suggest we pair this behavior with a new option: Still, though, do we do this for UTF8 only, or try to adopt something that will work for all supported character encodings? |
I've suitably renamed the issue. |
@sria91 thanks for bringing this up. For a long time now I have thought tidy should add the appropriate charset meta to a tidied document, like it adds a
Now, as @balthisar points out this is not simple to decide, what and when, like your list of possibilities @srai91, and given that tidy already has an option But I do not think this should be confused with the input There are actaully three options internally, and maybe the comments help -
And be aware, the html5 simple charset meta is rejected by the W3C validator if the document has a legacy doctype. There it expects the html4 meta content form. Unfortunately, tidy does not presently make this distinction, but maybe it should. This may include new options like So, yes, there is a lot to discuss and decide here... And @sria91 in your given sample what happens if you add |
@geoffmcl When I said "BOM is hardly used nowadays" I meant presently available document authoring tools which usually default to 'no BOM'. When I add What should be the default behaviour of Just FYI: Opera displays a document 'without BOM and without Isn't detecting the encoding from the content a versatile option? We are all using |
Ah, you've stumbled into the can of worms! This is why would require a carefully written specification (or write one ourselves): expectations have to be formalized, and done so in a way that applies to Tidy in general (not just UTF8). We have all of the possible sources of data:
The "simplest" thing to do would be to trust the user and simply ensure that a meta charset or content-type is present, and is appropriate to the Here's a beginning draft specification: For versions of Tidy greater than 5.x.y, Tidy will enforce the use of charsets in its output.
This new feature is enabled with the When
Certainly I’m missing something above, but it’s a start. Specifically it makes no mention of "detecting" UTF8. Absent a BOM, any bytes outside of the ASCII character range are just bytes; they can be anything. The |
@balthisar The issue that has been raized, formalized in the above 7 points is a good starting point. Relying on the user to select the proper input encoding format and specify a proper output encoding format. |
@sria91 thanks for testing the addition of a BOM effects the browser, as it does for the W3C validator... this is good to know... Some points about the BOM. Presently tidy only detects 3 BOM's. The
And if it does issue such a warning will keep that info in Now, sites like Wikipedia lists 11 BOMs... So the question is, should tidy be extended? And need to explore what happens if there exists one of the other BOMs, which tidy will presently put back into the stream (unget)! What mess will happen in later parsing? And to answer my own question, I would say @balthisar thanks for starting a specification. But concerning 1 & 2, there is some wording I do not understand. Sort of "found BOM ... a BOM not expected (error)". What spec are you using to decide a BOM is not expected? While I can read here and there that a BOM is not needed, but I can not see this as an As stated above on finding a BOM tidy already warns if it does not match the given input encoding. So I do not quite understand the idea in 1 and 2? But absolutely agree with - For versions of Tidy greater than 5.x.y, Tidy will enforce the use of charsets in its output.
So I see it as split in 3 cases - BOM or NOT or EITHER
Question: If an appropriate doctype metadata, and it matches the BOM, should both be kept? Suggest And yes, agree we could add an option like It seems clear to me the W3C feels there MUST be either a BOM or an appropriate charset meta on all html documents, so would also be ok with only the new behaviour. But as yet have not found a W3C spec which exactly states this. Of course there are many character encodings that do not have a BOM, so that can only be a doctype appropriate meta charset for those... The important issue is that a tidied document should have the best chance possible to pass W3C validation. And hopefully thus be correctly displayed by browsers... This fixes the original issue raised. Or to put it another way - This is Hmmmm, seems I added more questions than answers ;=)) |
Nope, you covered my meaning of the first bullet point with the clarification that Tidy already does BOM detection. The second bullet point has similar meaning. If a BOM is detected but there's a meta charset indicating a non-UTF document, then that's an issue. It's also a valid possibility if someone opens, say, a legacy iso2022-jp file in their text edit, and "converts" it to UTF but neglects to remove the now-incorrect meta element. |
@balthisar yes, as indicated tidy presently decodes 3 BOMs, and not suggesting it should do more... at present these seem to be the common possibilities encountered on the www... the middle And what the user added in config, and the results of that test of the first up to 3 bytes of the file, are used in This I would say I only fully understand parts of them... having never had say an And ok I now read your 2 as the same as my 1.ii... warn if that BOM does not match any meta As you point out the user has maybe converted the file to UTF, but forgot to correct the metadata. I am sure the W3C validator would have strong words to say abouth such confusion ;=)) need to test that more... |
Now My motivation is that if the document looks like html5, and does not have a meta charset, the W3C validator will raise an error! In simple terms that means what tidy presently outputs will not pass the validator! That is sad! Lots of ideas and specification have been given above, but we could start simple, If the default config of Is there anyone interested in working at least on this beginning? I would help where I can... Thanks... |
The following command produces output which is rendered incorrectly by current browsers.
Current
output.html
Required
output.html
The text was updated successfully, but these errors were encountered: