Skip to content

Test 427664 has different output on ARM, Intel platforms, maybe others... #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vielmetti opened this issue Sep 15, 2015 · 6 comments
Closed
Labels
Milestone

Comments

@vielmetti
Copy link
Contributor

This is one of two ARM related issues that I have isolated, where the output of tidy is different between two machines.

5 systems under test: 4 on Travis (Mac and Linux, gcc and clang compilers) and 1 on Raspberry Pi 2 (Raspbian/Hypriot with gcc 4.6.3).

The test is 427664 "Missing attr values cause NULL segfault". There is no segfault, but...the output of the tidy command is different, specifically this test case:

@test "427664: Missing attr values cause NULL segfault: msg" {
    run diff -c testbase/msg_427664.txt tmp/msg_427664.txt
    [ "$status" -eq 0 ]
}

On all 4 Intel systems under test, this passes. On the Pi 2, I get this diff:

emv@parviflorus:~/src/tidy-html5/test$ diff -c {testbase,tmp}/*msg_427664*
*** testbase/msg_427664.txt     2015-09-12 23:04:21.515581574 -0400
--- tmp/msg_427664.txt  2015-09-12 23:49:58.993100407 -0400
***************
*** 1,5 ****
  line 1 column 1 - Warning: missing <!DOCTYPE> declaration
! line 5 column 1 - Warning: <body> attribute name "��1/2" (value="xx") is invalid
  line 5 column 1 - Warning: <body> unexpected or duplicate quote mark
  line 5 column 1 - Warning: <body> unexpected or duplicate quote mark
  line 5 column 1 - Warning: <body> attribute "width" has invalid value "align="""
--- 1,5 ----
  line 1 column 1 - Warning: missing <!DOCTYPE> declaration
! line 5 column 1 - Warning: <body> attribute name "Ã1/2" (value="xx") is invalid
  line 5 column 1 - Warning: <body> unexpected or duplicate quote mark
  line 5 column 1 - Warning: <body> unexpected or duplicate quote mark
  line 5 column 1 - Warning: <body> attribute "width" has invalid value "align="""

It looks like a byte ordering issue in the message output. The output is coming from localize.c at https://github.com/htacg/tidy-html5/blob/master/src/localize.c#L85 and it looks like it's triggering it at lexer.c at https://github.com/htacg/tidy-html5/blob/master/src/lexer.c#L3921 .

Again, this all works on all of the 64 bit Intel systems that I have, and fails only on 32 bit ARM. There's nothing obviously wrong in the code, so my next step is to create some more interesting failing test cases.

@geoffmcl
Copy link
Contributor

@vielmetti thanks for reporting this difference...

As you point out, this does not seem to happen in linux (64-bit Intel), nor Windows (32-bit and 64-bit builds)... But will carefully re-test these 3 situation...

And I do not think it happened on linux (32-bit Intel), but that machine went down, so can not test there again...

So not sure it is a 32 vs 64 bit situation... nor memory byte order... although I guess ARM is bigendian? Which is opposite to Intel...

We are talking about a byte sequence after the word 'name', between the double quotes...

First looking at the source - Input: input\in_427664.html

Editor Shows: Ã1/2
And using say $ xxd -g 1 in_427664.html, shows a byte sequence of -
Hex Dump: C3 31 2F 32

But that is strange in that the comment in the source says - <!-- attribute name is 2 bytes hex c3 87 -->. Well the c3 is there, but no 87???

So the first question is why, how, when did this file get changed? I will check its history... Hmmm, it is also C3 31 2F 32 in the 2009 SF CVS source... so it was inherited this way by this repo...

But that puzzle is nothing to do with different tidy text output shown here...

Then looking at testbase: testbase\msg_427664.txt

Editor Show: name "??1/2"
Hex Dump: 6E 61 6D 65 20 22 EF BF BF EF BF BF 31 2F 32 22

WOW, now we are getting " ef bf bf ef bf bf 31 2f 32 "!!!!!!!!!!!!

And when I run this test in Windows and Linux, I get that same strange output... so no diff...

Unfortunatley I think the values show first in this post have been transformed either by the diff you are using, or the cut and paste from your screen... or something else... In the first it is " 3F 3F 31 2F 32 ", and in the second it is " C3 31 2F 32 ", which is intersting since that is what is in the source...

But as stated, I think these are trasformations by the tools used to create the text... so are no use or indication...

So I need you to -

  • confirm this EF BF BF EF BF BF 31 2F 32 is what is in your testbase
  • and show me the hex content in your ARM tmp/msg_427664.txt

Please not all, just the sequence that causes the trouble...

And I do not think you need to build other test cases at this stage. Quite likely, if we can solve this one difference in character encoding, it might solve them all...

Meantime I will try to understand why c3 got ef-bf-ed by tidy???

Welcome to the crazy world of character encodings ;=))

@geoffmcl geoffmcl added this to the 5.1 milestone Sep 17, 2015
@geoffmcl geoffmcl added Bug and removed Distribute labels Sep 17, 2015
@geoffmcl geoffmcl changed the title Test 427664 1 #4 "Missing attr values cause NULL segfault" has different output on ARM, Intel platforms Test 427664 has different output on ARM, Intel platforms, maybe others... Sep 18, 2015
@geoffmcl
Copy link
Contributor

Have not yet had a chance to dig further into this 427664 difference, and while the above is about a difference that seems to be between Intel and ARM CPUs, have now identified two more differences between Linux and Windows, both 64-bit Intel machines...

  1. 431895 - This can NOT be avoided. It is the only case where emacs output is enabled, and in the emacs file name output, naturally the path separator will always be different, aside from the fact that the unix scripts add an unnessary ./ in front.
  2. 676205 - Both the msg and out are different. This does not look like a character encoding issue so perhaps it should be a separate issue, but for now it is open and here.

It seems the only solutions for 1. is to exclude it, or mangle the windows app a little to use unix path seps for this filename output... not too difficult to fix... and fix the unix script to avoid the leading ./... then no diff...

The output in 2. has different line numbers, and even have a different doctype... Very, very strange!!! need to investigate more...

Errata: Above item 2 erroneously had 878205, now fixed.

@geoffmcl
Copy link
Contributor

@vielmetti just an update on these test differences...

test 676205

Found the problem with test 676205. As indicated seemed like different files were being processed, and that was it.

If the file input/in_676205.html, which has no doctype, is deleted, the test input will be the correct one, namely input/in_676205.xhtml, and the output will now exactly match that in testcases/out_676205.html.

I guess this was a difference between how the test file was selected in linux and windows scripts. linux was selecting the HTML, and windows the XHTML, with the correct doctype.

Will push this fix, well actually just a deletion, soonest...

test 431895

As explained above, test 431895 can not be avoided since it is the one case where the filename is output to the message file and thus there can always be differences between the platform path separators. Correct it for one, and the others can show a difference...

The only choices here are -

  1. to remove it from testcases.txt,
  2. be prepared to accept a different message output, but still same markup,
  3. patch the code to always use / for emacs output, and drop any leading ./

Any help on option 3 very welcome... should not be too difficult...

test 427664

As both my machines have Intel CPUs I can not replecate this problem, so can do nothing more to try to understand the difference on an ARM.

However, recently there was a change in the internal moving of characters to the lexer, issue #286, and maybe this would change something to do with this?

All the recent development in the issue-65 branch has now been merged to master, and that branch will eventually be deleted.

So if you get a chance to pull the latest master 5.1.17+ version, build and test again it would be appreciated. Thanks...

@geoffmcl
Copy link
Contributor

Well, fixing 431895 using the option 3. turned out easier than expected...

The TidyEmacsFile name is set for each file in the input -
https://github.com/htacg/tidy-html5/blob/master/console/tidy.c#L1299 -
it's simple to massage it and return a standardised name in an allocate, or even a static buffer...

The following patch should do it!

diff --git a/console/tidy.c b/console/tidy.c
index 430df55..5297d03 100644
--- a/console/tidy.c
+++ b/console/tidy.c
@@ -954,6 +954,34 @@ static void unknownOption( uint c )
     fprintf( errout, "HTML Tidy: unknown option: %c\n", (char)c );
 }

+/* Issue #266 - test 676205 - standardise emacs file name */
+static tmbstr getEmacsFilename( ctmbstr htmlfil )
+{
+    static tmbstr emacsFile = 0;
+    size_t i, out, len = strlen(htmlfil);
+    int c;
+    if (!emacsFile) {
+        emacsFile = (tmbstr)malloc(264);
+        if (!emacsFile) outOfMemory();
+    }
+    out = 0;
+    for (i = 0; i < len; i++) {
+        c = htmlfil[i];
+        if ((i == 0) && (len > 2) && (c == '.') ) {
+            c = htmlfil[i+1];
+            if (( c == '/' ) || ( c == '\\' )) {
+                i++;
+                continue;
+            }
+        }
+        if (c == '\\')
+            c = '/';
+        emacsFile[out++] = c;
+    }
+    emacsFile[out] = 0;
+    return emacsFile;
+}
+
 int main( int argc, char** argv )
 {
     ctmbstr prog = argv[0];
@@ -1295,8 +1323,8 @@ int main( int argc, char** argv )
 #if (!defined(NDEBUG) && defined(_MSC_VER))
             SPRTF("Tidying '%s'\n", htmlfil);
 #endif // DEBUG outout
-            if ( tidyOptGetBool(tdoc, TidyEmacs) )
-                tidyOptSetValue( tdoc, TidyEmacsFile, htmlfil );
+            if ( tidyOptGetBool(tdoc, TidyEmacs) ) /* #266 - use std name */
+                tidyOptSetValue( tdoc, TidyEmacsFile, getEmacsFilename(htmlfil) );
             status = tidyParseFile( tdoc, htmlfil );
         }
         else

If someone wants to pick this up, and test it, maybe improve the getEmacsFilename() function, that would be great... only quickly tested it in windows... thanks...

@mcepl
Copy link

mcepl commented Nov 13, 2015

Also, it would neat if we could get some kind of simple command (make check?) to run the tests.

@geoffmcl
Copy link
Contributor

@mcepl sorry, almost missed this comment... not sure neat qualifies as a use case, but I understand ;=))

We have opened a discussion on the tests and testing procedures - see #330

Once we have all the tests sorted out, fixed, running correctly, testing and showing what we want, yes it would be possible to add a $ make check target. Good idea... thanks...

As a windows developer, I have already scripted that to -

???> htt # get to the test folder
test> alltest # run the tests
test> diff -ua testbase temp-5 # compare with test base

The equivalent windows cmake command to $ make check would be, like

build\cmake> cmake --build . --config Release --target check

So either would work for me... If you want to add that to the CMakeLists.txt, in your fork, then would appreciate a patch or PR, and it will certainly be tried... but as stated, please add that to #330 discussion...

It seems the only other open discussion here is also on those very tests, so will close here, and cross-reference here from there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants