-
Notifications
You must be signed in to change notification settings - Fork 429
Parser too greedy over <script> blocks #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry for the long delay in a response... |
Doesn't http://www.w3.org/TR/2014/REC-html5-20141028/scripting-1.html#restrictions-for-contents-of-script-elements say, that you must not use the string '<script…' inside the script-element? |
@HoffmannP, thanks for that link. Yes, it does clearly indicate that However, if that is the case then the W3C validator is also in error by not But then what about the role of tidy as a 'fixer'. My patch could see this It just seems to me the current MESS that tidy outputs in this case is
Or alternatively at least to flag it as an error, thus no output unless What do you think? Will also try to cross-post this to the lists to perhaps |
I absolutely agree, that what tidy currently outputs is unacceptable. Now you might want to simulate what the human eye recognizes at once:
is legit JS inside a script tag. But what happens if the JS is invalid (or it's not JS but whatever else, e.g. JS-fragment or gibber-script):
is that still as obviouse? And even valid JS code can be a problem:
I really see no way to know what the user intends. The best™ behaviour would be to flag the content of a script tag if it does not match the ABNF. But how to know the correct intended script-tag content? The snake is biting itself in the tail… |
@HoffmannP, glad we agree the current output is unacceptable ;=)) But tidy has to have some minimal parser of the script tag contents, otherwise it can not report that it does not match the ABNF! It is agreed it could never be perfect, with so many script type options, and given that the script could even contain errors, but without any review of the content, tidy can only output the mess it does now. Glad you eventually worked out the strange markdown needed to include html snippets. Maybe like me you found the |
So I read the specification about five times and seem to have some kind of temporaral thinking barrier. I pseudocoded the ABNF:
You are allowed to use as many times |
@HoffmannP, ok, I am now completely LOST! When someone clearly explains why @NoNoNo original input is valid HTML5 maybe I can do something... I too have read the link more than 5 times, and obviously have more than a 'temporal' barrier ;=)) Even adding To try to briefly explain the tidy internal code process... maybe that will help... From within ParseBody(...), is finds
From within ParseScript(...), the contents are parsed by GetCDATA(...), which is also used for style tags (I think - not checked)... When it finds It notes what has been collected, 'script', matches the container, which is 'script', and bumps nested++;, and returns to CDATA_INTERMEDIATE state, continuing to accumulate until the next Finding the Since tidy is now still LOCKED in one level of 'script' it will consume the rest of the document, looking for the next Now at the EOF, not having found the end of the container, will report missing end tag, and return all that was collected as a text node, and return to ParseScript(...), exiting here again reporting missing Now the only way out I see is to NOT trip over the But without further understanding I will leave this issue for now. Try to catch my breath... |
@NoNoNo @HoffmannP maybe it is time to try to do something about this long outstanding problem in scripts... |
I'll note that this test case is not in the |
The test case in_1642186.html is on Sourceforge at http://sourceforge.net/p/tidy/patches/63/ with a proposed patch. Since 1642186 is not in our set of test cases yet, I'll submit a PR. |
@vielmetti thanks for the test case 1642186, now merged... I hope someone gets the chance to
This bug has been open too long... but personally, at the moment, can not get the enthusiasm to dig back into this... But will try to help if there are some questions... |
@vielmetti glad you are interested in fixing this... look forward to it ;=)) Can maybe help on understanding the lexer, if that is what you need...
Not sure why you think this? Tidy is in a very special parsing state, collecting script element text... Making a small change in this particular, special parser will not effect other than whether tidy trips over a And the bulk of the old patch is a single yes/no service, If you can not easily see where this service should be used, can maybe help since I am sure I can find the original patched code somewhere, I hope... But in any case it is where the script parser sees a In simple terms the patch was to not trip over what look like html element if the text is in javascript comments, or is within a Let's get it done at last... But am really puzzled why this is a case for CI? #269 Once the coding is done, you now have test 1642186, and other samples above, to check your progress... Pass me a patch, and I will test it also... On CI for tidy, I am yet to hear an argument on how this will help tidy??? Simple, 1, 2, 3... |
I need to test every single test case on every single platform, because I'm making a change, and I don't know where I are going to have regressions. Ideally this universal test case coverage covers everything, even the ones I don't expect to fail, and ideally it runs silently and unobtrusively and in the background every time anyone makes a change, and it alerts right in the PR if something is amiss. And it doesn't cost anything except the continuous upkeep of test cases once it's set up. |
Here's the corresponding pull request within my fork (just these changes). https://github.com/vielmetti/tidy-html5/pull/8 I'll work on a proper pull request. |
@vielmetti, you added nothing new, except perhaps more FUD... this is gettting like pulling teeth... Sorry, got a little impatient over this, it is my 2007 patch afterall ;=)) @vielmetti thanks you for your work on this... What you did showed me it was dead simple... especially reminding me where the test needed to be done... So, found my original 2007 code, did a cut and paste job, with some little tweaks on the way, and put the results in an To get the code:
Have aleady done the Again very sorry. I should have got in there and done this ages ago, but just could not get around to it... But as suggested, it was your efforts that got me motivated. Thanks. |
@NoNoNo after 3 year 25 days (give or take a little) this fix is in After some more testing, have merged the
Of course this new option a. In a javascript comment, either A massive speed increment could had by keeping some state flags in the lexer. At present it begins the check at the beginning of the stream stored in the lexer at that time each time... This is no problem if the script is relatively short, but have seen some html with enormous scripts... But remember it is only keyed by finding a And of course it is never called unless the user specifically adds the option... Will close this old bug after some period of testing... |
Have added a test/t1.sh, and t1.bat, to the repo, to be able to quickly do a single test, in our so called 228 Thanks to @vielmetti assistance we have a test available for this issue, 1642186-1... so can do |
Chis found a problem parsing...
Think I have found the error, and will push a fix shortly... |
Greetings, The parser has trouble with a solitary ' in a comment. |
@NoNoNo @HoffmannP @CMB @hoehrmann and all, the approach I took of trying to parse the script is running into big TROUBLE! Sorry... As someone predicted along the way, this trying to parse So I have switched back to the The simple idea now is to not increment a container count. This would only apply only to It seems this would eliminate, or at least reduce the problem count ;=)) To test this version you need to -
Only this branch contains this change... it is experimental!!! Please find the time to checkout this HistoryThis has got nothing to do with supporting HTML5... tidy has this problem before this HTACG fork of SF CVS tidy... As documented it seems this change occurred in a cvs change by @hoehrmann as sf cvs rev 1.85, Apr 9 2003, but there is no clear information that keeping track of the container openers, and expecting the same number of close tags is beneficial, or needed, at least for style and script tags.. That push did fix another bug 443678... there is a test case input/in_443678.html - https://sourceforge.net/p/tidy/bugs/98/ - all now closed... and this version 5.1.14.EXP1 continues to pass that test. Seek verification. But why this addition of keeping track of container openers? And then seeking an equal number of container closes before exit?? There seeems no old test case or bug number for this??? This is clearly parsing the data in the GetCDATA() service as HTML, instead of just looking for the first exposed Current
|
This is only if nonested is on, then a <script> tag has not incremented the nested, so likewise no need to treat an escaped close tag <\/script> as an end tage to decrement nested.
Karl gave me a sample which still has a problem with version 5.1.14.EXP1 - <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>script bug in EXP1</title>
<script>
document.writeln("<script language=JavaScript> var foo = bar; <\/script>");
var a = "<\/script>";
</script>
</head>
<body>
<p>para 1</p>
</body>
</html> This is because when the nonested option is on, then the Thus when escaped close tag text In fact since nested is already zero, tidy will treat this escaped tag as a real close tag, with catastophic consequences... The above sample html passes the W3C validator, no problem... Thanks Karl for finding this... Have bumped to version This As stated, please help with that... |
I haven't seen any recurrences of this issue in any of its forms for |
@CMB thanks for testing and reporting... I too have continued testing, and this fix is starting to feel very solid ;=)) As @NoNoNo reported, my original patch was 5 plus years ago, and his original report now nearly 3 years ago, so this is not before time... There are some small related matters outstanding -
But will consider closing this shortly... |
Note the Still to address the 2 related matters, but this will now be in |
Added some more tests to 1642186-1, and adjusted Just the name of the option remaining... as previously mentioned, considering Thus changing the enumeration to -
And the XML help text to -
Will effect this change in a few days, unless a better idea is presented... A final consideration is whether to default this option to on. This allows tidy to deal with all types of script and style data format. Other than test 1642186-1, there is no other test of this script and style tags data format, so defaulting to on does not effect any of the other some 227 tests. As also discussed above this stops tidy rendering a disaster in certain cases on files that usually pass the W3C validator. Why would anyone want such a wrecked document by default? And, in hind sight, why would tidy need to keep track of such nested html while in parsing script or style data? Now this seems wrong in principal. There seem all |
I vote for on, and the name sounds better than skip-quotes. |
@NoNoNo @HoffmannP @vielmetti @CMB @balthisar - Effected name change, and default to on... Only what? 8 years in the making ;=)) I think this can be closed in a few days, if no direct feedback... |
Moving my comment to a new issue |
I ran into a known bug, open for 5 years:
http://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659
Proposed patch:
http://sourceforge.net/tracker/?func=detail&atid=390965&aid=1644645&group_id=27659
gives me:
The text was updated successfully, but these errors were encountered: