Skip to content

Unescaped & emitted despite using **output-xhtml** key bindings in 5.6.0 in PHP bindings #704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TysonAndre opened this issue Mar 27, 2018 · 5 comments

Comments

@TysonAndre
Copy link

TysonAndre commented Mar 27, 2018

This occurred when I upgraded from tidy-html5 5.4.0 to 5.6.0. I also upgraded php at the same time to 7.1.14, but php-src/ext/tidy has no recent changes

#207 (comment) seems related. I don't see any discussion about how xhtml would be affected in #207

http://validator.w3.org/check would warn about hello & bye as an XHTML fragment. If XHTML5 is HTML5 represented as XML, shouldn't that be automatically fixed by tidy-html5?

  • If I'm missing any necessary options, please let me know

The below script reproduces this bug with PHP 7.1.9 (Running tidy-html5 5.4.0) and 7.1.14 (Built with tidy-html5 5.6.0)

<?php
// Related to https://github.com/htacg/tidy-html5/issues/526 ?
const CONFIG = [
  'output-xhtml' => true,
  'show-errors' => false,
];
 
function print_tidy(string $html, array $config) {
	$tidy = new tidy();
	$tidy->parseString($html, $config, 'utf8');
	$tidy->cleanRepair();
	// errors are in $tidy->errorBuffer)
	$errors = [];
	if (!empty($tidy->errorBuffer)) {
	    $errors = array_merge($errors, explode("\n",$tidy->errorBuffer));
	}
	$clean = (string)$tidy;
	printf("In %s: %s (errors %s)\nconfig: %s\n", PHP_VERSION, $clean, json_encode($errors), json_encode($config));
}
print_tidy('hello & bye', CONFIG);
// EDIT: I've also tried print_tidy('<body>hello & bye</body>', CONFIG + ['preserve-entities' => true, 'quote-ampersand' => true, 'doctype' => 'strict', 'output-encoding' => 'ascii', 'input-encoding' => 'utf8']);
// but none of those additional options resulted in something different with the tidy-html5 bindings
/*
In 7.1.14: <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
hello & bye
</body>
</html> (errors [])
config: {"output-xhtml":true,"show-errors":false}
*/
/*
In 7.1.9: <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
hello &amp; bye
</body>
</html> (errors [])
config: {"output-xhtml":true,"show-errors":false}
*/

More details: This seems specific to the PHP bindings (Or maybe CLI bindings are setting options that I missed). echo 'hello & bye' | ./tidy -asxhtml works properly for me, generating hello &amp; bye in both tidy-html5 5.4.0 and 5.6.0 (with make clean then rebuilding with git checkouts

When I rebuilt php 7.1.14's tidy extension with tidy-html5-5.4.0, the result was hello &amp; bye, as expected/wanted.

For details on how I built php, see #673 (comment)

@geoffmcl
Copy link
Contributor

@TysonAndre thank you for your issue, but I am sorry, I am a little confused as to what the actual tidy issue is here, as opposed to a possible PHP binding issue...

As you point out passing the simple string hello & bye to tidy, it yields exactly that in default html5 mode, and if you add -asxhtml then it will output hello &amp; bye...

And if I pass the following to current tidy, with or without -asxhtml -

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>Is #704-1</title>
</head>
<body>
hello & bye
</body>
</html>

I get back hello &amp; bye, with the unescaped & ... warning...

Although I have not yet checked any references about & escaping in XHTML, but passing the above file to the W3C validator will not flag any warning or errors. This tends to indicate that tidy is being over agressive in always escaping the ampersand if doing xhtml output...

This suggests the commit 3a524f1, made for #207, should have also extended to XH50 but that seems a new issue...

But that seems the opposite of what you are seeing in the PHP binding 7.1.14 (Tidy 5.6). It is outputting just hello & bye, which from the above seems correct, but you seem to want it to be like 7.1.9 (Tidy 5.4)...

Now I can see some reasons for this, maybe? You are parsing a fragment so without a doctype and a <html xmlns=...> declaration, libtidy will assume default HTML5, not XHTML5...

What do you get if you pass my above sample to the PHP bindings?

I am guessing you should get hello &amp; bye. That is without extending the fix for #207, tidy will choose to add the escaping you want... but as indicated, I now do not see that as correct... it is not an ambiguous ampersand so can stay unescaped...

And I just remembered I have built a unix php 7.3.0-dev on 20180212, installed in /media/Disk2/projects/install/php/bin, my hphpi, which uses my 5.7.3.I673 libtidy, 2018/01/01, when dealing with issue #673, and it still runs...

So I put your above sample <?php ... code into an is704.php file, and ran ./php is704.php, and I get XHTML5 output, which includes hello &amp; bye, so it seems I can not even duplicate the problem... I get what I expect, although as stated now not sure that is correct...

And that further reminded me that I also built it in Windows, using MSVC, installed in C:\php, hphpi.bat, and again setting up an is704.php file, and get -

C:\php>type is704.php
<?php
// Related to https://github.com/htacg/tidy-html5/issues/526 ?
const CONFIG = [
  'output-xhtml' => true,
  'show-errors' => false,
];

function print_tidy(string $html, array $config) {
        $tidy = new tidy();
        $tidy->parseString($html, $config, 'utf8');
        $tidy->cleanRepair();
        // errors are in $tidy->errorBuffer)
        $errors = [];
        if (!empty($tidy->errorBuffer)) {
            $errors = array_merge($errors, explode("\n",$tidy->errorBuffer));
        }
        $clean = (string)$tidy;
        printf("In %s: %s (errors %s)\nconfig: %s\n", PHP_VERSION, $clean, json_encode($errors), json_encode($config));
}
print_tidy('hello & bye', CONFIG);
?>

C:\php>php is704.php
In 7.1.14: <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
hello &amp; bye
</body>
 (errors [])
config: {"output-xhtml":true,"show-errors":false}

Same result as in unix!

But maybe I do not quite understand the issue... please provide more feedback... thanks...

@geoffmcl geoffmcl added this to the 5.7 milestone Mar 28, 2018
@TysonAndre
Copy link
Author

TysonAndre commented Mar 28, 2018

Strange. I was able to reproduce with 3a30f6a (master) cmake ../../; make clean install, then --with-tidy=shared,/path/to/install (Did you have any extra cmake options? Is anything platform dependent?)

I'm still see hello & bye with 2018/01/01 next. (I checked tidy_get_version())


But that seems the opposite of what you are seeing in the PHP binding 7.1.14 (Tidy 5.6). It is outputting just hello & bye, which from the above seems correct, but you seem to want it to be like 7.1.9 (Tidy 5.4)...

Yes. For example, the below script would be rejected by PHP's libxml parsing (If you come across any options to accept unambiguous ampersands, they'd be useful to me)

  • Could you refer me to the XML spec allowing & followed by space/< characters? My impression was that XHTML5 was HTML5 serialized as valid XML, and a quick google search just revealed stack overflow results without specification links
<?php
// Related to https://github.com/htacg/tidy-html5/issues/526 ?

// Using the same as before, but additionally with `'show-body-only' => true,`
// I get '&' instead of '&amp;'
$fragment_from_tidy = 'hello & bye';

// This ensures that we log directly instead of through PHP notices
$oldUseErrors = libxml_use_internal_errors(true);
try {
    $root = new DomDocument();
    $success = $root->loadXML("<body>$fragment_from_tidy</body>");
    $xmlErrors = libxml_get_errors();
    if (count($xmlErrors) > 0) {
        printf("Saw xml errors: %s\n", json_encode($xmlErrors));
    }
    printf("Was successful: %s\n", json_encode($success));
} finally {
    libxml_clear_errors();
    libxml_use_internal_errors($oldUseErrors);
}

/*
Actual output with PHP's libxml due to the fragment received from tidy in my build:
It doesn't accept tidy-html5's "unambiguous ampersand" and expects there to always be a name.

Saw xml errors: [{"level":3,"code":68,"column":14,"message":"xmlParseEntityRef: no name\n","file":"","line":1}]
Was successful: false
*/

but passing the above file to the W3C validator will not flag any warning or errors.

The default setting for the W3C validator is HTML. Did you select XML fragments?

xml_err1
xml_err_2

EDIT: I see "in some contexts". Is it possible to add a flag to force escaping as &amp; because php's libxml doesn't accept that

@geoffmcl
Copy link
Contributor

@TysonAndre thank you for the further feedback...

Let me deal with the validator first. As I am sure you are aware there are two validators, run by different technologies. There is the legacy html4 validator, https://validator.w3.org/, and the later nu, https://validator.w3.org/nu/, for html5 documents.

So if you pass just your fragment, hello & bye, to the legacy validator, and select XHTML 1.0, it is parsed per HTML4 rules, and like tidy, in html4 mode, will warning about the ampersand, &, and escape it... agreed, and understood...

And if you pass just the fragment to nu, you will get errors, since nu has no option to treat the input as an xhtml fragment... but...

If you check the default, Validate Full Document, in the legacy validator, and you pass the full sample I gave above, because it has a HTML5 doctype, <!DOCTYPE html>, it will be passed to nu to parse, as HTML5/XHTML5, with no errors or warnings...

As stated, this implies that ampersand is valid in HTML5, and XHTML5...

Now I can not locate specific W3C that documents what we could call an unambiguous ampersands, but there is an ambiguous ampersand, and since in this case it is not followed by an ASCII letter or digit we can only imply it is not ambiguous...

And as you pointed out in the EDIT even the validator error uses the phrase may be valid in some contexts!

Now I have pushed 3 tests to my repo in_704.html - your initial fragment, in_704-1.html - the above full xhtml5 doc, and in_704-2.html - full html5 document.

These files can be validated using a URL like - https://rawgit.com/geoffmcl/tidy-test/master/test/input5/in_704-1.html, or in_704-2.html, both of which PASS.

And this leads me to the suggestion that the fix for #207 is incomplete, in that at present tidy will warn and escape in_704-1.html, but like I say that can be a separate issue...

So your further feedback leaves 2 things -

  1. Why does your build of using the master branch not escape the ampersand?
  2. When/If tidy is fixed to not escape & in xhtml5, a new option to add the escape.

The first is simple. As discussed in #673, and have now added PR #705, the master, ie. release/5.6 branch, the AdjustConfig logic got changed, and this causes you to see hello & bye, not sure why...

And if I checkout the next 5.7.3 version of tidy and install it I will get the same as you!

It is only if I checkout, build and install the 'issue-673' branch will I get hello &amp; bye... simple...

Now concerning the building of tidy, and php, first you show the tidy build as cmake ../../; make clean install???

Well first the cmake ../.., trailing / not required, since you do not offer an install directory, cmake will choose /usr/local, I think, and if you offer no build type again think cmake defaults to debug...

I use cmake ../.. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr to ensure both of these are set to what I want.

Then I am not sure what make clean install does? What is the clean for? And unless you do all the compiling in root mode, I have to use sudo make install...

That gets me $ whereis libtidy output of libtidy: /usr/lib/libtidy.so.

And it is worth doing something like $ find /usr -name "libtidy*" to see what has been installed, and maybe do some clean up... You may have several version of tidy installed, like me, but...

You should only have a current active link libtidy.so -> libtidy.so.5 -> libtidy.so.5.6.0 in your case. In my case the last is libtidy.so.5.7.3, built using the 'issue-673` branch...

Now concerning the building of php I used $ ./configure --prefix=/media/Disk2/projects/install/php --with-tidy=/usr, then $ make install. Note no sudo required here since I am installing it locally only. This installed a php executable in the above <prefix>/bin.

And in that bin dir, if I run $ readelf -d php, to dump the dynamic section, I get an entry Shared library: [libtidy.so.5].

And as stated this php gives me hello &amp; bye output with the 'output-xhtml' -> true config. BUT that is because I used the issue-673 branch... Of course if I remove that config entry I will get hello & bye...

So this clearly explains why we are seeing something different. That is commit 67eaeb6 is not in release 5.6, only in the issue-673 branch...

And as indicated, only if-and-when tidy is patched to give the same output for XHTML5 as HTML5 would we potentially need 2., a Feature Request for a new option to always escape every &, if it is not a valid entity.

Would need to think about this option a little, and consequences, but it should be possible... if there is a need, a strong use case...

You suggest because php's libxml doesn't accept that, but is that not a case to fix libxml, not add an option to tidy...

Does this add anything? thanks...

@TysonAndre
Copy link
Author

first you show the tidy build as cmake ../../; make clean install???

Sorry. I left out the install prefix assuming it was implied. To be clear, I used an install prefix in all commands. I also checked the tidy version, and I'm also confused about that.

There may be something I missed.

The linked spec appears to be for html5, and has limited references to XHTML. In any case, the library I'm using works with html fragments.

When/If tidy is fixed to not escape & in xhtml5, a new option to add the escape

I would support a new option to do that.

geoffmcl added a commit that referenced this issue Apr 21, 2018
Is #673 - Revert 350f7b4 and 86e62db AdjustConfig logic - Is #704 PR #705
@geoffmcl
Copy link
Contributor

@TysonAndre have now merged PR #705 and hope that closes this... and #673

If I missed something please feel free to re-open, or a new issue... thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants