Skip to content

Detect content-charset and ensure the presence of correct metadata #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sria91 opened this issue Jan 31, 2016 · 14 comments
Closed

Detect content-charset and ensure the presence of correct metadata #361

sria91 opened this issue Jan 31, 2016 · 14 comments
Milestone

Comments

@sria91
Copy link
Contributor

sria91 commented Jan 31, 2016

The following command produces output which is rendered incorrectly by current browsers.

$ echo '<!doctype html><html><head><title>ಕನ್ನಡ</title></head><body>ಸವಿಗನ್ನಡ</body></html>' | tidy.exe -i -utf8 -o output.html

Current output.html

<!DOCTYPE html>
<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for HTML5 for Windows version 5.1.34">
  <title>ಕನ್ನಡ</title>
</head>
<body>
  ಸವಿಗನ್ನಡ
</body>
</html>

Required output.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="generator" content=
  "HTML Tidy for HTML5 for Windows version 5.1.x">
  <title>ಕನ್ನಡ</title>
</head>
<body>
  ಸವಿಗನ್ನಡ
</body>
</html>
@balthisar
Copy link
Member

According to The W3C:

If an HTML document does not start with a BOM, and its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding, and the encoding must be specified using a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state.

So first I don't see that this is a bug, but a feature request, because Tidy can't determine whether or not the page is delivered without the proper content type.

This leaves two points for discussion:

  • Do we check for the existence of a UTF8 BOM before adding this metadata? BOMS are actually not recommended for UTF8, but if the BOM is present then the user agent should know that the file is UTF8, thus no metadata is required. What should Tidy do? (And what do we do if the options specifically state that a file is a different character encoding?)
  • Next, do we use meta charset as requested, or use the also-allowed <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>?

That would address just this request, but opens up a larger can of worms that we would have to address (I'm not saying "no", just pointing out that it's not as simple as limiting ourselves to the above). Right now Tidy doesn't try to interpret http-equiv or meta charset; instead we count on the -encoding options, both of which default to UTF8. This brings us the questions:

  • What do we do if we detect UTF8 via the BOM or an attribute, but the input-encoding and/or output-encoding contradict this?

In general we defer to the user's options; but

  • What happens if we can't output Kannada or another script in the user's options? Should we force everything to UTF8? (YES!, but my opinion doesn't make for a friendly program).
  • Should Tidy generate an attribute for every conceivable output-encoding?

I think the specific request is fairly simple, but if we're going to start getting involved with character sets, it shouldn't be piecemeal; we really need to have a very good specification for desired behavior given multiple conditions, including how we deal with meta charset vs http-equiv in the input stream, how to handle conflicts, etc.

@sria91
Copy link
Contributor Author

sria91 commented Jan 31, 2016

@balthisar This affects not just Kannada but any input content encoded using 'utf-8' (in this case). Please note the -utf8 switch in command line.

  • BOM is hardly used nowadays. In case of old html documents, when the UA sees the BOM
    • The metadata and -encoding also supports the claim, the UA has to properly decode the content. Tidy need not touch the encoding.
    • If there exists a conflict between metadata, -encoding and BOM, Tidy then has 4 options:
      1. Strip the BOM and add metadata (which recommended by W3C, but risky)
      2. Retain the BOM and remove metadata (which is not recommended by W3C)
      3. Retain the BOM and add metadata (most compatible)
      4. It could raise an error/warning and quit gracefully (safe).
  • When there are no conflicts whatsoever and the input stream is missing the metadata tidy will add the metadata.

Regarding meta charset vs http-equiv, content W3Schools says

Using http-equiv is no longer the only way to specify the character set of an HTML document:

HTML4.01: <meta http-equiv="content-type" content="text/html; charset=UTF-8">
HTML5: <meta charset="UTF-8">

When compatibility with old browsers is desired we could use both meta charset and http-equiv, content. When backwards compatibility is not an issue it is good to use meta charset alone.
One thing that must be noted is that

The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document. - W3C

@balthisar
Copy link
Member

Oh, I get that. If our output-encoding is UTF8, we could simply add a meta charset, or ensure there's one already there, just like we ensure that <head> is always there, regardless of whether we detect any characters that require UTF8.

But thinking of the bigger picture, Tidy is still HTML<5 compatible, and it supports a large number of character sets. I personally believe in an all-UTF world, but I can't force that on all users of Tidy, and I'm not certain if we want Tidy to treat UTF8 as a special, privileged child. Or maybe we do.

Mostly my point is that this certainly merits a larger discussion than exceeds my own personal whims (which we be to support nothing but UTF8 output and force the world to modernize -- see why my own personal whims aren't good enough?).

To provoke more discussion (don't worry, others will be by), I might suggest we pair this behavior with a new option: add-charset with an enum, "no, http-equiv, charset", with a default dependent on the doctype, i.e., either http-equiv or charset. I don't like these multiple defaults, but we have them with, e.g., indent-with-tabs. This would provide new behavior while preserving old behavior for people that have the use case (begging the question, does anyone have a use case for the old behavior?).

Still, though, do we do this for UTF8 only, or try to adopt something that will work for all supported character encodings?

@sria91 sria91 changed the title Add missing <meta charset="utf-8"> when html contains unicode characters Detect content-charset and ensure the presence of metadata Jan 31, 2016
@sria91
Copy link
Contributor Author

sria91 commented Jan 31, 2016

I've suitably renamed the issue.

@geoffmcl
Copy link
Contributor

@sria91 thanks for bringing this up. For a long time now I have thought tidy should add the appropriate charset meta to a tidied document, like it adds a <title> if missing... My reason was solely that the the W3C validator blabs an error if there is none, and no BOM!

Error: The character encoding was not declared. Proceeding using windows-1252.
or
Error: The character encoding was not declared. Proceeding using utf-8.

Now, as @balthisar points out this is not simple to decide, what and when, like your list of possibilities @srai91, and given that tidy already has an option --output-bom yes/no/auto(def) and an --output-ecoding raw/ascii/latin0/latin1/utf8(def)/iso2022/mac/win1252/ibm858/utf16le/utf16be/utf16/big5/shiftjis! And not sure I agree with "BOM is hardly used nowadays". The validator, tidy, many editors recognizes it!

But I do not think this should be confused with the input --char-encoding similar list option, but it does involve the equivalent short forms like -utf8 which sets both input and output at the same time...

There are actaully three options internally, and maybe the comments help -

  TidyCharEncoding,    /**< In/out character encoding */
  TidyInCharEncoding,  /**< Input character encoding (if different) */
  TidyOutCharEncoding, /**< Output character encoding (if different) */

And be aware, the html5 simple charset meta is rejected by the W3C validator if the document has a legacy doctype. There it expects the html4 meta content form. Unfortunately, tidy does not presently make this distinction, but maybe it should.

This may include new options like add-charset, but I would probably see that as a Bool - yes (def) - add meta for current doctype if none, or no - do nothing, like the present current/old behaviour... I too do not like multiple options which force the user to know more than they want... although this happens already...

So, yes, there is a lot to discuss and decide here...

And @sria91 in your given sample what happens if you add --output-bom yes... does the browser then display it corrently?

@sria91
Copy link
Contributor Author

sria91 commented Feb 1, 2016

@geoffmcl When I said "BOM is hardly used nowadays" I meant presently available document authoring tools which usually default to 'no BOM'.

When I add --output-bom yes to the commandline the browser displays correctly. Thanks 😄.

What should be the default behaviour of tidy so that a readily usable document is generated as output (in cases like this)? i.e. when there is no conflict and meta charset is missing.
Should it add the meta charset? Yes.

Just FYI: Opera displays a document 'without BOM and without meta charset' correctly. What could it be doing?
Could it be detecting the encoding from the content? Yes.
Or could it be defaulting to use utf-8? No.

Isn't detecting the encoding from the content a versatile option? We are all using tidy to tidy up the document. Shouldn't it report when there are conflicts in encoding used and declared?

@balthisar
Copy link
Member

@sria91,

Isn't detecting the encoding from the content a versatile option? We are all using tidy to tidy up the document. Shouldn't it report when there are conflicts in encoding used and declared?

Ah, you've stumbled into the can of worms! This is why would require a carefully written specification (or write one ourselves): expectations have to be formalized, and done so in a way that applies to Tidy in general (not just UTF8).

We have all of the possible sources of data:

  • Tidy's input-encoding
  • Tidy's output-encoding
  • Bytes used in the actual file, which may or may not be UTF8.
  • The presence or non-presence of a UTF8 or UTF16 BOM.
  • The presence or non-presence of meta charset.
  • The presence or non-presence of content-type.

The "simplest" thing to do would be to trust the user and simply ensure that a meta charset or content-type is present, and is appropriate to the output-encoding. But even this isn't so simple. What if the file already contains a conflicting meta charset?

Here's a beginning draft specification:

For versions of Tidy greater than 5.x.y, Tidy will enforce the use of charsets in its output.

  • HTML<5 will use the meta content-type attribute.
  • HTML5 and newer will use the meta charset attribute.

This new feature is enabled with the enforce-charset-attribute configuration option, which defaults to YES. Setting this option to NO shall restore previous behavior.

When enforce-charset-attribute is YES, then Tidy shall report warnings/errors in the following circumstances:

  • A UTF BOM is detected when input-encoding does not specify a character encoding for which a BOM is expected (error).
  • A UTF BOM is detected when a charset attribute is detected that does not specify that a BOM is expected (error).
  • A charset attribute is missing (warning, will be added)
  • meta charset is used HTML<5 (warning, will be changed to content-type)
  • Attribute does not match the declared output-encoding (warning, will be changed to indicate output-encoding)
  • Both meta charset and content-type are found (warning, will drop one depending on HTML level).
  • Edit: content-type is found on HTML5 (info, recommend changing to meta charset).

Certainly I’m missing something above, but it’s a start. Specifically it makes no mention of "detecting" UTF8. Absent a BOM, any bytes outside of the ASCII character range are just bytes; they can be anything. The input-encoding gives good hints, though, and so I suspect that we'll have to trust the user on that one. Absent a BOM or this guidance, there's no way to know that a particular byte stream represents UTF, iso2020, MacRoman, whatever.

@sria91
Copy link
Contributor Author

sria91 commented Feb 1, 2016

@balthisar The issue that has been raized, formalized in the above 7 points is a good starting point. Relying on the user to select the proper input encoding format and specify a proper output encoding format.

@sria91 sria91 changed the title Detect content-charset and ensure the presence of metadata Detect content-charset and ensure the presence of correct metadata Feb 1, 2016
@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 1, 2016

@sria91 thanks for testing the addition of a BOM effects the browser, as it does for the W3C validator... this is good to know...

Some points about the BOM. Presently tidy only detects 3 BOM's. The in->encoding is set from cfg TidyInCharEncoding, def=UTF8 -

  • UNICODE_BOM_BE=0xFEFF - UTF16BE - warn ENCODING_MISMATCH if in->encoding not UTF16 or UTF16BE
  • UNICODE_BOM_LE=0xFFFE - UTF16LE - warn ENCODING_MISMATCH if in->encoding not UTF16 or UTF16LE
  • UNICODE_BOM_UTF8=0xEFBBBF - UTF8 - warn ENCODING_MISMATCH if in->encoding not UTF8

And if it does issue such a warning will keep that info in doc->badChars |= BC_ENCODING_MISMATCH; to use later in ErrorSummary, but strangely does not output anything more for this BC_ENCODING_MISMATCH flag, despite the comment #define BC_ENCODING_MISMATCH 16 /* fatal error */! Some mysteries here? Should it be changed to an error? Should more summary info be given in this case?

Now, sites like Wikipedia lists 11 BOMs... So the question is, should tidy be extended? And need to explore what happens if there exists one of the other BOMs, which tidy will presently put back into the stream (unget)! What mess will happen in later parsing?

And to answer my own question, I would say no at this time... let that come up as a new issue if a use case is ever found.

@balthisar thanks for starting a specification. But concerning 1 & 2, there is some wording I do not understand. Sort of "found BOM ... a BOM not expected (error)". What spec are you using to decide a BOM is not expected? While I can read here and there that a BOM is not needed, but I can not see this as an error if added! Maybe we could warn with say a BOM found, but it is not needed, but why? I read a BOM may be added for UTF8... Am I reading something wrong here?

As stated above on finding a BOM tidy already warns if it does not match the given input encoding. So I do not quite understand the idea in 1 and 2?

But absolutely agree with - For versions of Tidy greater than 5.x.y, Tidy will enforce the use of charsets in its output.

  • HTML4-- will use the meta content-type attribute.
  • HTML5++ will use the meta charset attribute.

So I see it as split in 3 cases - BOM or NOT or EITHER

  1. Known BOM found
    1. Warn (as now) or Error? if not matching input encoding.
    2. Warn (or Error?) if any not matching metatdata found, and none added. New!
    3. Write BOM to output, as now, matching output encoding.
  2. No BOM found
    1. Add BOM only if requested, matching output encoding.
    2. If no BOM requested, Warn and add appropriate doctype metadata, if none found. New!
  3. Either
    1. Warn if current metadata does not match doctype, and fix. New!
    2. If both meta charset and content-type are found. Warn, and drop one depending on doctype. New!

Question: If an appropriate doctype metadata, and it matches the BOM, should both be kept? Suggest yes.

And yes, agree we could add an option like enforce-charset-attribute, defaulting to yes, but it seems that would only be required if we really feel we need to preserve the old behaviour with a no!

It seems clear to me the W3C feels there MUST be either a BOM or an appropriate charset meta on all html documents, so would also be ok with only the new behaviour. But as yet have not found a W3C spec which exactly states this. Of course there are many character encodings that do not have a BOM, so that can only be a doctype appropriate meta charset for those...

The important issue is that a tidied document should have the best chance possible to pass W3C validation. And hopefully thus be correctly displayed by browsers... This fixes the original issue raised.

Or to put it another way - This is granddaddy Tidy! Should Tidy have a no option which renders an output invalid? Allow users to over-ride the good intentions of granddad? Suggest no!

Hmmmm, seems I added more questions than answers ;=))

@balthisar
Copy link
Member

@geoff,

But concerning 1 & 2, there is some wording I do not understand. Sort of "found BOM ... a BOM not expected (error)". What spec are you using to decide a BOM is not expected? While I can read here and there that a BOM is not needed, but I can not see this as an error if added! Maybe we could warn with say a BOM found, but it is not needed, but why? I read a BOM may be added for UTF8... Am I reading something wrong here?

Nope, you covered my meaning of the first bullet point with the clarification that Tidy already does BOM detection.

The second bullet point has similar meaning. If a BOM is detected but there's a meta charset indicating a non-UTF document, then that's an issue. It's also a valid possibility if someone opens, say, a legacy iso2022-jp file in their text edit, and "converts" it to UTF but neglects to remove the now-incorrect meta element.

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 2, 2016

@balthisar yes, as indicated tidy presently decodes 3 BOMs, and not suggesting it should do more... at present these seem to be the common possibilities encountered on the www... the middle w standing for wild in this case ;=))

And what the user added in config, and the results of that test of the first up to 3 bytes of the file, are used in uint TY_(ReadChar)( StreamIn *in ) for every character read from the stream there after...

This ReadChar, and its sister ReadCharFromStream, are quite a complex functions returning a uint, so handles multibyte characters of various input encodings... it is a forever loop, for (;;), reading the stream, until EOF, to get the next character, what ever it is... the lexer, the usual character string holder, is thus an array of uints...

I would say I only fully understand parts of them... having never had say an iso2022, big5, shiftjis, etc... file to play with, and explore... would welcome those for testing...

And ok I now read your 2 as the same as my 1.ii... warn if that BOM does not match any meta charset found. And I lean towards this being an error, not warning! The BOM tells tidy one things, and the metdata contradicts this!!!

As you point out the user has maybe converted the file to UTF, but forgot to correct the metadata. I am sure the W3C validator would have strong words to say abouth such confusion ;=)) need to test that more...

@geoffmcl
Copy link
Contributor

geoffmcl commented Mar 1, 2017

Now release 5.4 is out the door, and we have moved on to development next 5.5, I think it is time to give this Feature Request a definitive milestone of 5.5!

My motivation is that if the document looks like html5, and does not have a meta charset, the W3C validator will raise an error! In simple terms that means what tidy presently outputs will not pass the validator! That is sad!

Lots of ideas and specification have been given above, but we could start simple, If the default config of utf8 remains for both input and output, then tidy could add this missing <meta charset="utf-8">, and more documents would pass validation.

Is there anyone interested in working at least on this beginning? I would help where I can... Thanks...

@geoffmcl geoffmcl modified the milestones: 5.5, Indefinite future Mar 1, 2017
@geoffmcl
Copy link
Contributor

geoffmcl commented Mar 1, 2017

Oops, but is this not covered by #456, and WIP PR #458?

Maybe this should be closed, and we just continue with the above...

@geoffmcl
Copy link
Contributor

geoffmcl commented May 4, 2017

As suggested some time back, feel this is fully covered by #456, and WIP PR #458, so closing this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants