Skip to content

Parser too greedy over <script> blocks #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NoNoNo opened this issue Aug 22, 2012 · 37 comments
Closed

Parser too greedy over <script> blocks #65

NoNoNo opened this issue Aug 22, 2012 · 37 comments
Labels
Milestone

Comments

@NoNoNo
Copy link

NoNoNo commented Aug 22, 2012

I ran into a known bug, open for 5 years:
http://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659

Proposed patch:
http://sourceforge.net/tracker/?func=detail&atid=390965&aid=1644645&group_id=27659

<!DOCTYPE html>
<html>
<head><title></title>
<body>
<script>
var a = '<script';
</script>
</body>
</html>

gives me:

line 7 column 7 - Warning: '<' + '/' + letter not allowed here
line 8 column 5 - Warning: '<' + '/' + letter not allowed here
line 9 column 5 - Warning: '<' + '/' + letter not allowed here
line 5 column 1 - Warning: missing </script>
line 5 column 1 - Warning: missing </script>
@geoffmcl
Copy link
Contributor

Sorry for the long delay in a response...
Have moved the discussion of this 'feature' to -
https://lists.w3.org/Archives/Public/public-htacg-contrib/2015Jan/0003.html
Thank you for bringing this back to my attention... and hope we can perhaps close it here...

@balthisar balthisar added this to the Indefinite future milestone Jan 31, 2015
@balthisar balthisar added Bug and removed Pretty Print labels Feb 1, 2015
@balthisar balthisar modified the milestones: 5.0.0, Indefinite future Feb 1, 2015
@HoffmannP
Copy link

Doesn't http://www.w3.org/TR/2014/REC-html5-20141028/scripting-1.html#restrictions-for-contents-of-script-elements say, that you must not use the string '<script…' inside the script-element?

@geoffmcl
Copy link
Contributor

geoffmcl commented Mar 9, 2015

@HoffmannP, thanks for that link. Yes, it does clearly indicate that
<script should not be in a script element.

However, if that is the case then the W3C validator is also in error by not
flagging this as 'invalid'! Is this maybe a validator bug?

But then what about the role of tidy as a 'fixer'. My patch could see this
is in inverted commas, and could add the escaped <\script to successfully
fix the document, probably with a warning.

It just seems to me the current MESS that tidy outputs in this case is
quite unacceptable -

<body>
<script>
var a = '<script';
<\/script>
<\/body>
<\/html>
</script>
</body>

Or alternatively at least to flag it as an error, thus no output unless
forced. The idea is to not generate what would be seen as invalid
javascript! That is a seriously compromised document.

What do you think? Will also try to cross-post this to the lists to perhaps
have a wider audience.

@HoffmannP
Copy link

I absolutely agree, that what tidy currently outputs is unacceptable.
I also think, that tidy should not include an javascript-parser (which wouldn't even solve the problem as the script tag can theoretically contain anything else as well, e.g. JS-Fragments).

Now you might want to simulate what the human eye recognizes at once:

…
<script>
var a = '<script';
</script>
…

is legit JS inside a script tag. But what happens if the JS is invalid (or it's not JS but whatever else, e.g. JS-fragment or gibber-script):

…
<script>
var a = '
</script data=''</script>
…

is that still as obviouse? And even valid JS code can be a problem:

…
<script>
var a = '
</script data="</script>"';'</script>
…

I really see no way to know what the user intends. The best™ behaviour would be to flag the content of a script tag if it does not match the ABNF. But how to know the correct intended script-tag content? The snake is biting itself in the tail…

@geoffmcl
Copy link
Contributor

@HoffmannP, glad we agree the current output is unacceptable ;=))

But tidy has to have some minimal parser of the script tag contents, otherwise it can not report that it does not match the ABNF!

It is agreed it could never be perfect, with so many script type options, and given that the script could even contain errors, but without any review of the content, tidy can only output the mess it does now.

Glad you eventually worked out the strange markdown needed to include html snippets. Maybe like me you found the Markdown supported link helpful...

@HoffmannP
Copy link

So I read the specification about five times and seem to have some kind of temporaral thinking barrier. I pseudocoded the ABNF:

"not (<!--)"  *(  <!--          "not (-->|<script>)"  -->            "not (<!--)" )
outer         *(  comment-open  inner                 comment-close  outer        )

You are allowed to use as many times </script> inside the script-tab, just <script> is only allowed if enclosed in HTML-comments?!
So it seems that @NoNoNo 's example is perfectly valid HTML5

@geoffmcl
Copy link
Contributor

@HoffmannP, ok, I am now completely LOST! When someone clearly explains why @NoNoNo original input is valid HTML5 maybe I can do something... I too have read the link more than 5 times, and obviously have more than a 'temporal' barrier ;=))

Even adding var b = '<script>'; passes validation????

To try to briefly explain the tidy internal code process... maybe that will help...

From within ParseBody(...), is finds <script...>, so calls ParseScript(...). There is quite an explicit comment on this function -

  This isn't quite right for CDATA content as it recognises
  tags within the content and parses them accordingly.
  This will unfortunately screw up scripts which include
  < + letter,  < + !, < + ?  or  < + / + letter

From within ParseScript(...), the contents are parsed by GetCDATA(...), which is also used for style tags (I think - not checked)...

When it finds < + letter it switched out of the initial state CDATA_INTERMEDIATE into state CDATA_STARTTAG, and stays there collecting letters, and stops at the next non-letter char...

It notes what has been collected, 'script', matches the container, which is 'script', and bumps nested++;, and returns to CDATA_INTERMEDIATE state, continuing to accumulate until the next </script>.

Finding the </ + letter, switches to CDATA_ENDTAG state, again accumulating letters, and notes it has collected the same name as the container, and since matching, checks nested. Finding nested not yet 0, reports the first warning that </ + letter is not allowed here, and returns to CDATA_INTERMEDIATE state.

Since tidy is now still LOCKED in one level of 'script' it will consume the rest of the document, looking for the next </script> to break out... and will repeat the warning for </body>, and </html>, and add an escape to each... it is just accumulating CDATA text...

Now at the EOF, not having found the end of the container, will report missing end tag, and return all that was collected as a text node, and return to ParseScript(...), exiting here again reporting missing </script>, and return to ParseBody(...).

Now the only way out I see is to NOT trip over the var a = '<script';. Avoid taking this as a new nested container... that is what my very OLD patch did... not perfect, but did the job... and also note it was implemented with a new option, --skip-quotes, which was off by default, so the user had to request this different parsing... it would not change 'standard' tidy parsing at all...

But without further understanding I will leave this issue for now. Try to catch my breath...

@geoffmcl
Copy link
Contributor

@NoNoNo @HoffmannP maybe it is time to try to do something about this long outstanding problem in scripts...

@geoffmcl geoffmcl modified the milestones: 5.1, 5.0.0 Jul 13, 2015
@vielmetti
Copy link
Contributor

I'll note that this test case is not in the tidy test corpus yet.

@vielmetti
Copy link
Contributor

The test case in_1642186.html is on Sourceforge at http://sourceforge.net/p/tidy/patches/63/ with a proposed patch. Since 1642186 is not in our set of test cases yet, I'll submit a PR.

@geoffmcl
Copy link
Contributor

@vielmetti thanks for the test case 1642186, now merged...

I hope someone gets the chance to

  • get the patch I put in sourceforge, so long ago (2007) ;=))
  • or do it another, hopefully better, way...
  • provide a patch against master 5.1.9, or a PR

This bug has been open too long... but personally, at the moment, can not get the enthusiasm to dig back into this...

But will try to help if there are some questions...

@geoffmcl
Copy link
Contributor

@vielmetti glad you are interested in fixing this... look forward to it ;=))

Can maybe help on understanding the lexer, if that is what you need...

it's going to be necessary to test a whole lot of test cases

Not sure why you think this? Tidy is in a very special parsing state, collecting script element text...

Making a small change in this particular, special parser will not effect other than whether tidy trips over a <script piece of text, or not... please avoid spreading FUD!

And the bulk of the old patch is a single yes/no service, static Bool IsInQuotesorComments( Lexer * lexer ) to check where <scrip text appears, so it a simple drop in piece of code...

If you can not easily see where this service should be used, can maybe help since I am sure I can find the original patched code somewhere, I hope...

But in any case it is where the script parser sees a <, checks some more, and closes this script parsing, exiting that parser prematurely...

In simple terms the patch was to not trip over what look like html element if the text is in javascript comments, or is within a "...", or '...' blocks of text...

Let's get it done at last...

But am really puzzled why this is a case for CI? #269

Once the coding is done, you now have test 1642186, and other samples above, to check your progress...

Pass me a patch, and I will test it also...

On CI for tidy, I am yet to hear an argument on how this will help tidy??? Simple, 1, 2, 3...

@vielmetti
Copy link
Contributor

I need to test every single test case on every single platform, because I'm making a change, and I don't know where I are going to have regressions. Ideally this universal test case coverage covers everything, even the ones I don't expect to fail, and ideally it runs silently and unobtrusively and in the background every time anyone makes a change, and it alerts right in the PR if something is amiss. And it doesn't cost anything except the continuous upkeep of test cases once it's set up.

@vielmetti
Copy link
Contributor

Here's the corresponding pull request within my fork (just these changes).

https://github.com/vielmetti/tidy-html5/pull/8

I'll work on a proper pull request.

@geoffmcl
Copy link
Contributor

@vielmetti, you added nothing new, except perhaps more FUD... this is gettting like pulling teeth...

Sorry, got a little impatient over this, it is my 2007 patch afterall ;=))

@vielmetti thanks you for your work on this... What you did showed me it was dead simple... especially reminding me where the test needed to be done...

So, found my original 2007 code, did a cut and paste job, with some little tweaks on the way, and put the results in an issue-65 branch, and added a 1642186-1 test, which passes the W3C validator, and it all works...

To get the code:

$ cd tidy-html5
$ git checkout issue-65
$ cd build/cmake
$ ./build-me.sh
$ ./tidy -config ../../test/input/cfg_1642186-1.txt ../../test/input/in_1642186-1.html
$ # Should exit 0, No warnings or errors were found.

Have aleady done the alltest.cmd in Windows. Will shortly do it all in linux... if you could add ARM... and we still are in much need of a MAC tester...

Again very sorry. I should have got in there and done this ages ago, but just could not get around to it... But as suggested, it was your efforts that got me motivated. Thanks.

@geoffmcl
Copy link
Contributor

@NoNoNo after 3 year 25 days (give or take a little) this fix is in master, v.5.1.10 onwards ;=))

After some more testing, have merged the issue-65 branch into master. Need to do -

$ cd tidy-html5
$ git checkout master
$ git pull
$ cd build/cmake
$ ./build-me.sh

Of course this new option --skip-quotes yes, it default to no, could be expanded. At present the skip logic is very simple. The service returns yes if this <+letter just found in the data stream while parsing CDATA, is -

a. In a javascript comment, either /* ... */ or to end-of-line line form
b. Inside quotes, either single or double.

A massive speed increment could had by keeping some state flags in the lexer. At present it begins the check at the beginning of the stream stored in the lexer at that time each time...

This is no problem if the script is relatively short, but have seen some html with enormous scripts...

But remember it is only keyed by finding a <+letter in the stream. If this sequences does not occur then it is never called...

And of course it is never called unless the user specifically adds the option...

Will close this old bug after some period of testing...

@geoffmcl
Copy link
Contributor

Have added a test/t1.sh, and t1.bat, to the repo, to be able to quickly do a single test, in our so called 228 regression tests...

Thanks to @vielmetti assistance we have a test available for this issue, 1642186-1... so can do
$ ./t1.sh 1642186-1 0 and check the results...

@geoffmcl
Copy link
Contributor

Chis found a problem parsing...

<script>
   document.write('<style>body{font-family:"'+XCOOKIE.gui_font+'",Verdana, Arial;}</style>');
</script>

Think I have found the error, and will push a fix shortly...

geoffmcl added a commit that referenced this issue Sep 19, 2015
@CMB
Copy link
Contributor

CMB commented Oct 5, 2015

Greetings,
I found another recurrence:
http://the-brannons.com/test5.html

The parser has trouble with a solitary ' in a comment.

geoffmcl added a commit that referenced this issue Oct 7, 2015
This is in the GetCDATA function. If the container is script or style and
this option is on, avoid bumping nested.

This addresses issues #65 (1642186) and #280.

All attempts at parsing script data are now abandoned as a bad direction.
geoffmcl added a commit that referenced this issue Oct 7, 2015
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 7, 2015

@NoNoNo @HoffmannP @CMB @hoehrmann and all, the approach I took of trying to parse the script is running into big TROUBLE! Sorry...

As someone predicted along the way, this trying to parse <script>, and maybe <style> for exceptions is filled with danger... now is the time for them to say I told you so... but let's move on...

So I have switched back to the issue-65 branch to experiment with change before committing them to the stable master...

The simple idea now is to not increment a container count. This would only apply only to <scrip> and <style> tags, and only when the --skip-quotes yes option is enabled... so is very limited in effect...

It seems this would eliminate, or at least reduce the problem count ;=))

To test this version you need to -

$ cd tidy-html5
$ git checkout issue-65
$ git pull
$ cd build/cmake
$ [optional] cmake-clean *
$ ./build-me.sh
$ ./tidy -v
$ HTML Tidy for <OS> version 5.1.14.EXP1
  • this make-clean is a script I have to completely blow away all the last build. You should at least delete CMakeCache.txt... Although most build systems take good care of dependencies, it is a good idea to do this occassionaly...

Only this branch contains this change... it is experimental!!!

Please find the time to checkout this issue-65 branch, and participate in fixing this very old problem in tidy... thanks...

History

This has got nothing to do with supporting HTML5... tidy has this problem before this HTACG fork of SF CVS tidy...

As documented it seems this change occurred in a cvs change by @hoehrmann as sf cvs rev 1.85, Apr 9 2003, but there is no clear information that keeping track of the container openers, and expecting the same number of close tags is beneficial, or needed, at least for style and script tags..

That push did fix another bug 443678... there is a test case input/in_443678.html - https://sourceforge.net/p/tidy/bugs/98/ - all now closed... and this version 5.1.14.EXP1 continues to pass that test. Seek verification.

But why this addition of keeping track of container openers? And then seeking an equal number of container closes before exit?? There seeems no old test case or bug number for this???

This is clearly parsing the data in the GetCDATA() service as HTML, instead of just looking for the first exposed </script> or </style> to stop the parsing, and return a text node to be attached as a child node of the container...

Current issue-65 branch

Essentially any attempt to parse <script> or <style> has been abandoned. PHEW ;=))

Now the only change in GetCDATA() is -

  1. establish a Bool nonested = ((nodeIsSCRIPT(container) || (nodeIsSTYLE(container))) && cfgBool(doc, TidySkipQuotes)) ? yes : no;
  2. if nonested is on, avoid an increment of nested with if (matches && !nonested) nested++;

Now Chris set up 5 test cases... thanks Chris -

http://the-brannons.com/test.html
http://the-brannons.com/test2.html
http://the-brannons.com/test3.html
http://the-brannons.com/test4.html
http://the-brannons.com/test5.html

All these 5 cases now seem no problem, with or without setting the option on. They were all problems introduced by attempting to do parsing of a script. Now all pass. To be verified.

Now there is one test case already in the repo, namely input/in_1642186-1.html. It will fail without setting the --skip-quotes yes, and passes with this option. Need verification.

Also if this turns out to be the solution!, we may want to re-name the skip-quotes option to something more appropriate. Still thinking of a better name. Maybe just skip-nested? Suggestions welcome.

This suggested change, in branch issue-65 needs to be tested!

Please help with that...

geoffmcl added a commit that referenced this issue Oct 8, 2015
This is only if nonested is on, then a <script> tag has not incremented
the nested, so likewise no need to treat an escaped close tag <\/script>
as an end tage to decrement nested.
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 8, 2015

Karl gave me a sample which still has a problem with version 5.1.14.EXP1 -

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>script bug in EXP1</title>
<script>
document.writeln("<script language=JavaScript> var foo = bar; <\/script>");
var a = "<\/script>";
</script>
</head>
<body>
<p>para 1</p>
</body>
</html>

This is because when the nonested option is on, then the <script...> text found does not increment the nested variable... it stays zero...

Thus when escaped close tag text <\/script> is found, then no longer need to still treat it as a close tag just to do the nested decrement...

In fact since nested is already zero, tidy will treat this escaped tag as a real close tag, with catastophic consequences...

The above sample html passes the W3C validator, no problem...

Thanks Karl for finding this...

Have bumped to version 5.1.14.EXP2, note the 2, for this fix, so make sure you do $ git pull before re-building, and testing...

This issue-65 branch still needs more testing!

As stated, please help with that...

@geoffmcl geoffmcl mentioned this issue Oct 8, 2015
@CMB
Copy link
Contributor

CMB commented Oct 18, 2015

I haven't seen any recurrences of this issue in any of its forms for
the last couple of weeks.
Thank you.

@geoffmcl
Copy link
Contributor

@CMB thanks for testing and reporting...

I too have continued testing, and this fix is starting to feel very solid ;=))

As @NoNoNo reported, my original patch was 5 plus years ago, and his original report now nearly 3 years ago, so this is not before time...

There are some small related matters outstanding -

  • Name of options - the present --skip-quotes yes does not feel exactly right, and
  • Add some more of tests in its various forms...

But will consider closing this shortly...

@geoffmcl
Copy link
Contributor

Note the issue-65 branch has now been merged into master, so this issue-65 branch is now dead and will eventually be removed.

Still to address the 2 related matters, but this will now be in master, so will leave this open for a while longer...

@geoffmcl
Copy link
Contributor

geoffmcl commented Nov 1, 2015

Added some more tests to 1642186-1, and adjusted testbase directory accordingly...

Just the name of the option remaining... as previously mentioned, considering skip-nested...

Thus changing the enumeration to -

  TidySkipNested,      /**< Skip nested tags in script and style CDATA */

And the XML help text to -

  {TidySkipNested,
   "This option specifies that Tidy should skip nested tags "
   "when parsing script and style data. "
  },

Will effect this change in a few days, unless a better idea is presented...

A final consideration is whether to default this option to on. This allows tidy to deal with all types of script and style data format. Other than test 1642186-1, there is no other test of this script and style tags data format, so defaulting to on does not effect any of the other some 227 tests.

As also discussed above this stops tidy rendering a disaster in certain cases on files that usually pass the W3C validator. Why would anyone want such a wrecked document by default?

And, in hind sight, why would tidy need to keep track of such nested html while in parsing script or style data? Now this seems wrong in principal.

There seem all pros for defaulting to on!

@balthisar
Copy link
Member

I vote for on, and the name sounds better than skip-quotes.

@geoffmcl
Copy link
Contributor

geoffmcl commented Nov 5, 2015

@NoNoNo @HoffmannP @vielmetti @CMB @balthisar - Effected name change, and default to on...

Only what? 8 years in the making ;=))

I think this can be closed in a few days, if no direct feedback...

@KevinCarhart
Copy link

Moving my comment to a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants