-
Notifications
You must be signed in to change notification settings - Fork 429
Feature request: add meta tag that matches actual output character encoding #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@AthanasiusOfAlex thanks for the It seems a reasonable feature request, as a new option, to add this, styled depending on the document type detected, like Maybe the option could be say There is an OPTIONS.md which details the adding of an option... quite simple... Would certainly consider this addition, if you or others presented code... thanks... |
I'm currently learning c, so I wouldn't mind taking this on (as it seems like a simple enough feature request). @geoffmcl, could you mentor me on it? (i.e., give me some pointers as to where it might be good place to insert this in the code tree, etc.). |
@marcoscaceres, thanks for your offer to take this on, and for sure I will help where I can... As mentioned OPTIONS.md gives the 4 steps to do this... and we should discuss and decide here the values for 1, 2, and 3... my first suggestion is as follows, but am open to other ideas -
As stated, these are just quick, first thoughts, suggestions, and look forward to them being massaged... As to part 4., using it in the code, take a look at AddGenerator, in Note the different styles, based on whether this is html5, xhtml, or a legacy document... and like the Look forward to your code, either as a patch, or make a fork, add say a branch, and present it as a PR... thanks! |
@geoffmcl thanks for all the pointers, this is super helpful! Only additional requirement is to make sure that the meta charset appears in the first 512 bytes, as per HTML5. I'll try to send this in parts over the next week. |
Regarding
Let's see how I go with 4 now :) |
See WIP PR #458 |
Since PR #458 has already been move to milestone 5.5, doing likewise here... I hope you get the time to finish this great WIP soon... please ask if you need further help... thanks... |
Thanks @geoffmcl - doing my best to find time. If anyone wants to jump in and do a bit more hacking on it, I would certainly be supportive. It's nearly there! |
Just noted we have an older #361 which also has some good notes on this... the effort should be combined... I will look deeper into this in the coming days, since as stated, always prefer that tidy does not output invalid html... |
@AthanasiusOfAlex, @marcoscaceres, @balthisar as agreed, further discussion on this new option I have continued with some testing, using tidy-tests and find a few problems...
But these are not big problems, and easily addressed... But this regression testing also showed that I think, at least initially, this option should default to no. Otherwise, when this option is eventually merged a considerable number of test cases would need to be reset... There might be a day when we do default this to And @balthisar, yes I will get around to setting up a @marcoscaceres what is the purpose of the two new files, |
@geoffmcl, I think those are legacy files, given that he based off of |
This pulls the work done by @marcoscaceres WIP #458 into the issue-456 branch, to complete the new add-meta-charset option.
@balthisar yes, thanks for legacy info... have dealt with that and merged @marcoscaceres branch The open issues identified at this stage are the 2 mentioned above, the addition of the two messages, replacing the Also at present I think if it finds a meta charset it does not warn, or change the type, when it conflicts with the current output doctype... but is this wrong anyway... need to research that... So still some work to be done perhaps even before a PR is created... Any help appreciated... thanks... |
@geoffmcl, I'll be happy to look at it, too. I'm working through our outstanding issues by oldest first, so I think this one will come up in the queue soon. If you beat me to, then you're welcome to implement the fixes, otherwise I'm hoping I'll be allowed some time this weekend. |
@balthisar well I started to look at it and do some fixing, but then noted some more small problems... Although I had suggested using a TidyBuffer, it might not be appropriate since the string in the buf.bp is not guaranteed to be a zero terminated C string, so even code like It may be possible to ensure it is zero terminated with something like Anyway, ended up pushing nothing.... So if you get a chance to look at it, at least removing the And then there is the question if an encoding mis-match is detected, should tidy fix it... or just warn... I would like a fix I think... Quite an interesting service as you get into it, with many choices... ;=)) |
@balthisar just a quick ping to say I did find time to work on this, and think I have solved all the problems... and maybe we do not need any new messages... Have prepared some tests, and am preparing a report, with still some questions to decide, but hope to push my results later today... just a heads up in case you also started looking at it... |
@balthisar, @marcoscaceres first it seems a As I prepared some test files, the first thing noted that Tidy has been silently correcting the output encoding of the Of course we still have to correct the But if added here then should also be added if So now I am thinking this new service should always be called, to fix the Just 6 tests files added at this time, covering most situations...
Test 4 is the same as case 1117013, but now no crash... At this stage have coded for no warning on meta addition, if option on, nor on modification, as before, but warn on discards, and depreciated Now set to always calling this new service Also maybe this new service should be moved out of Hope others will get the chance to checkout this |
@geoffmcl, afraid I won't get to it for a few days, unfortunately, in all likelihood. Will try to sneak a peak... |
With latest commits, moved the new |
Arrgh! just got around to Immediate issue if the use of in-place Working on it... all look solvable... |
@balthisar thanks for the #554 reminder, and indeed one of the remaining issues is case related... but this is the case of the Have effected some code changes, but there remains three (3), outstanding
Case 1: 378b: This issue is about a utf-8, e acute, Ok, that change was very recent, #378, so maybe I need to rebase/merge How can I avoid this? What should I do? @balthisar I sometimes see you creating a new branch? It would be really great if someone could checkout Case 2: 586562: This Now this case appears generated by the old As stated, need to recheck what Would appreciate feedback on this... Case 3: 676205: What is the correct case for the meta http-equiv contents Looking at say a IANA REF most are in upper case, but some are mixed. Tidy's internal Testbase input is uppercase - This would not be trivial change in There used to be two The I tend towards using the users input, since it is not wrong, but on the other hand we could use tidy's lowercase table for general consistency... Again seek feedback on this... So far I have not pushed my fixes, in But would really like Case 1: 378b: to drop off my radar first... any help getting there with git magic help would be most appreciated... thanks... diff --git a/src/clean.c b/src/clean.c
index b4e9a38..026bfe4 100644
--- a/src/clean.c
+++ b/src/clean.c
@@ -2309,8 +2309,8 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
Node *prevNode;
TidyBuffer buf;
TidyBuffer charsetString;
- tmbstr httpEquivAttrValue;
- tmbstr lcontent;
+ /* tmbstr httpEquivAttrValue; */
+ /* tmbstr lcontent; */
tmbstr newValue;
/* We can't do anything we don't have a head or encoding is NULL */
if (!head || !enc || !TY_(tmbstrlen)(enc))
@@ -2330,7 +2330,7 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
tidyBufAppend(&charsetString, "charset=", 8);
tidyBufAppend(&charsetString, (char*)enc, TY_(tmbstrlen)(enc));
tidyBufAppend(&charsetString, "\0", 1); /* zero terminate the buffer */
- /* process the children of the head */
+ /* process the children of the head */
for (currentNode = head->content; currentNode; currentNode = currentNode->next)
{
if (!nodeIsMETA(currentNode))
@@ -2339,10 +2339,10 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
httpEquivAttr = attrGetHTTP_EQUIV(currentNode);
if (!charsetAttr && !httpEquivAttr)
continue; /* has no charset attribute */
- /*
- Meta charset comes in quite a few flavors:
- 1. <meta charset="value"> - expected for (X)HTML5.
- */
+ /*
+ Meta charset comes in quite a few flavors:
+ 1. <meta charset="value"> - expected for (X)HTML5.
+ */
if (charsetAttr && !httpEquivAttr)
{
/* we already found one, so remove the rest. */
@@ -2355,8 +2355,8 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
continue;
}
charsetFound = yes;
- /* Fix mismatched attribute value */
- if (TY_(tmbstrcmp)(TY_(tmbstrtolower)(charsetAttr->value), enc) != 0)
+ /* Fix mismatched attribute value - note case insensitive match */
+ if (TY_(tmbstrcasecmp)(charsetAttr->value, enc) != 0)
{
newValue = (tmbstr)TidyDocAlloc(doc, TY_(tmbstrlen)(enc) + 1); /* allocate + 1 for 0 */
TY_(tmbstrcpy)(newValue, enc);
@@ -2367,11 +2367,16 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
TidyDocFree(doc, charsetAttr->value); /* free current value */
charsetAttr->value = newValue;
}
+#if 0 /* 000000000000000000000000000000000 not sure about this
+ but have read certain documents where the charset
+ should be present in the first 1024 bytes of the
+ doc... and what about likewise for the html4 form? */
/* Make sure it's the first element. */
if (currentNode != head->content->next) {
TY_(RemoveNode)(currentNode);
TY_(InsertNodeAtStart)(head, currentNode);
}
+#endif /* 00000000000000000000000000000000 */
continue;
}
/*
@@ -2391,24 +2396,30 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
currentNode = prevNode;
continue;
}
- httpEquivAttrValue = TY_(tmbstrtolower)(httpEquivAttr->value);
- if (TY_(tmbstrcmp)(httpEquivAttr->value, (tmbstr) "content-type") != 0)
- continue; /* is not 'content-type' */
+ /* httpEquivAttrValue = TY_(tmbstrtolower)(httpEquivAttr->value); */
+ if (TY_(tmbstrcasecmp)(httpEquivAttr->value, (tmbstr) "Content-Type") != 0)
+ continue; /* is not 'Content-Type' */
if (!contentAttr->value)
{
+#if 0 /* 00000000000000000000000000000
+ case 1117013 (in_456-4) keeps this 'content=""'! But WHY? */
prevNode = currentNode->prev;
/* maybe need better message here */
TY_(ReportError)(doc, head, currentNode, DISCARDING_UNEXPECTED);
TY_(DiscardElement)(doc, currentNode);
currentNode = prevNode;
+#endif /* 0000000000000000000000000000 */
continue;
}
/* check encoding matches
If a miss-match found here, fix it. previous silently done
in void TY_(VerifyHTTPEquiv)(TidyDocImpl* doc, Node *head)
*/
- lcontent = TY_(tmbstrtolower)(contentAttr->value);
- if (TY_(tmbsubstr)(lcontent, charsetString.bp))
+ /* lcontent = TY_(tmbstrtolower)(contentAttr->value);
+ Note: 'tmbsubstr' uses 'tmbstrcasecmp`, so 'ISO-'
+ will match 'iso-'! Is this desired?
+ see cases 586562 and 676205 */
+ if (TY_(tmbsubstr)(contentAttr->value, charsetString.bp))
{
/* we already found one, so remove the rest. */
if (charsetFound)
@@ -2446,7 +2457,8 @@ Bool TY_(TidyMetaCharset)(TidyDocImpl* doc)
}
/*
3. <meta charset="utf-8" http-equiv="Content-Type" content="...">
- This is generally bad. Discard and warn.
+ This is generally bad. Discard and warn.
+ Not so sure about this, but seems a reasonable idea?
*/
if (httpEquivAttr && charsetAttr)
{ Any and all feedback very welcome... this seems a good option, but important to get it right... thanks |
Updated the But this is still a WIP! Somehow, I am getting stale on this... I just can not seem to get it right... With the last desperate commit, reusing the old service if the option is off, which it is by default, so I still use that service, but should not have to... But still about 3 regression tests fail in the Any help appreciated... maybe a new pair of eyes will quickly see what I am missing ... thanks... |
Updated the Will try to find the |
@geoffmcl, sorry, I've been incommunicado in the great forests of Canada. I'll try to have a look over the weekend. |
@balthisar no problem... hope it was a great adventure... It seems I was shooting myself in the foot with a stupid reversed logic error... BAH! Have fixed that, and now with the default Of course, with this option So this has taken a good step forward... have now 100% replaced the previous service, with the new single service, which can be discussed and refined... As suggested have put a number of comments in the code, and will try to find time to enumerate and discuss them here... trying to make the right choices... so when you get the chance... thanks... |
This issue is moving forward, but there are a numbers of QUESTIONS which I seek feedback on... There seems no problem when there is no existing meta charset in the document. Simply add one of the appropriate type, and encoding value, namely -
Oops, but that does raise a question. I now believe the others should be -
But that's an easy fix in the new single service... The questions pile up when there is an existing meta charset, of one type or the other, or indeed more than one... so -
While have never actually seen this in RL, it could happen, and Tidy needs to choose what to do about it. At present it just discards the 2nd or later meta... just keeping the first... But if we do choose to discard, and warn about the discard, a further improvement would be to discard those that do not match the doctype...
Obvious Tidy should fix this, but this could involve, lead to the next case...
At present nothing is done about this... But to be perfectly correct, Tidy should change the meta to a declaration that matches the doctype...
Presently tidy does what it does in other case - silently fixes problems - As has been addressed, discussed, in several other issues, maybe there should be at least a suppressible Then there is the strange test case 1117013.html which has what is really a mal-formed meta -
The original new service was coded to discard this, but have commented it out for now, due to conforming to the test Probably some of these issues could be decided re-reading the W3C specs, and maybe by presenting a sample to the W3C validator, and choose an action accordingly... As mentioned above I have already pushed 6 test cases - With this option set Maybe more tests need to be added to specifically address the above questions, again matching with W3C specs, and validated... again help and feedback appreciated... thanks... PS: One unrelated minor quibble. Believing that this meta should be declared as early as possible in the document, I now do not like that the |
It also fixes the addition of the constant 'http-equiv="Content-Type" attribute.
@balthisar the above push adds an It also fixes the addition of the Moving forward slowly, hopefully... but look forward to feedback... thanks... |
@balthisar the above push adds an As mentioned in #561, leveraged an existing And while it gives good info to the user, it does not say what it replaced it with... but is ok... I suppose... tests Testing and feedback welcome... thanks... |
To fully maintain backward compatibility, add new option In the past Tidy fixed the This option, default to With this default option This new option As stated, testing and feedback welcome on this |
It's working quite nicely for me. I pushed back changes to address some merge conflicts in light of other PR's that were merged. If no objections or you don't beat me to it, this is a nice fix. |
Tidy currently allows you to output a document in 14 character encodings. The problem is, unless you use ascii, the web browser generally gets confused by it. It would be nice if Tidy had the option of adding the "meta" tag that matches the encoding, e.g.,
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
for UTF-8 (which is the default). For the moment, I have to add this line manually.
The text was updated successfully, but these errors were encountered: