Consider order of candidate encodings when guessing source encoding of input string #11239

alexdowad · 2023-05-14T12:09:43Z

In PHP 8.0 and before, mb_detect_encoding (and some other functions which attempt to 'detect' input string encoding) would do so essentially just by trying a list of candidates in order and picking the first candidate encoding in which the string was valid. In PHP 8.1, I made mb_detect_encoding use heuristics to try to choose the "most likely" input string source encoding in cases where more than one candidate encoding is valid.

This patch retains those heuristics, but weights the candidate text encodings by their order, so that those listed first are more likely to be chosen. This allows the caller to indicate what input text encoding they believe is more likely to be the correct one, by listing it first. This brings the behavior of mb_detect_encoding in closer harmony with the documentation.

Also, use @pakutoma's stricter validation code for ISO-2022-JP and UTF-7 to improve detection accuracy even in "non-strict" detection mode.

Closes GH-10192.

FYA @cmb69 @Girgias @kamil-tekiela @youkidearitai @pakutoma @iluuu1994

alexdowad · 2023-05-14T12:10:09Z

More detailed information is in the commit log, please see the commit log messages if interested.

alexdowad · 2023-05-14T12:35:04Z

Failure on ARM_DEBUG_NTS is spurious.

This will allow us to easily check in other mbstring functions if the list of all supported encodings, returned by mb_list_encodings, is passed in as input to another function. Co-authored-by: Ilija Tovilo <[email protected]>

…oding The documentation for mb_detect_encoding says that this function "Detects the most likely character encoding for string `string` from an ordered list of candidates". Prior to 28b346b, mb_detect_encoding did not really attempt to determine the "most likely" text encoding for the input string. It would just return the first candidate encoding for which the string was valid. In 28b346b, I amended this function so that it uses heuristics to try to guess which candidate encoding is "most likely". However, the caller did not have any way to indicate which candidate text encoding(s) they consider to be more likely, in case the heuristics applied are inconclusive. In the language of Bayesian probability, there was no way for the caller to indicate their 'prior' assignment of probabilities. Further, the documentation for mb_detect_encoding also says that the second parameter `encodings` is "a list of character encodings to try, in order". The documentation clearly implies that the order of the `encodings` argument should be significant. Therefore, amend mb_detect_encoding so that while it still uses heuristics to guess the most likely text encoding for the input string, it favors those which are earlier in the list of candidate encodings. One complication is that many callers of mb_detect_encoding use it in this way: mb_detect_encoding($string, mb_list_encodings()); In a majority of cases, this is bad code; mb_detect_encoding will both be much slower and the results will be less reliable than if a smaller list of candidates is used. However, since such code already exists and people are using it in production, we should not unnecessarily break it. The order of candidate encodings obviously does not express any prior belief of which candidates are more likely in this case, and treating it as if it did will degrade the accuracy of the result. Since mb_list_encodings now returns a single, immutable array on each call, we can avoid that problem by turning off the new behavior when we receive the array of encodings returned by mb_list_encodings. This implementation means that if the user does this: $a = mb_list_encodings(); mb_detect_encoding($string, $a); ...then the order of candidate encodings will not be considered. However, if the user explicitly initializes their own array of all supported legacy text encodings, then the order *will* be considered. The other functions which also follow this new behavior are: • mb_convert_variables • mb_convert_encoding (when multiple candidate input encodings are listed) Other places where "detection" (or really "guessing") of text encoding may be performed include: • mb_send_mail • Zend engine, when determining the encoding of a PHP script • mbstring processing of HTTP request contents, when http_input INI parameter is set to a list In these cases, the new logic based on order of candidate encodings is *not* enabled. It *might* be logical to consider the order of candidate encodings in some or all of these cases, but I'm not sure if that is true, so it seems wiser to avoid more behavior changes than is necessary. Further, ever since the new encoding detection heuristics were implemented in 28b346b, we have not received any complaints of user code being broken in these areas. So I am reluctant to "fix what isn't broken". Well, some might say that applying the new detection heuristics to mb_send_mail, etc. in 28b346b was "fixing what wasn't broken", but (cough cough) I don't have any comment on that...

…n non-strict mode In 6fc8d01, pakutoma added specialized validity checking functions for some legacy text encodings like ISO-2022-JP and UTF-7. These check functions perform a more strict validity check than the encoding conversion functions for the same text encodings. For example, the check function for ISO-2022-JP verifies that the string ends in the correct state required by the specification for ISO-2022-JP. These check functions are already being used to make detection of text encoding more accurate when 'strict' detection mode is enabled. However, since the default is 'non-strict' detection (a bad API design but we're stuck with it now), most users will not benefit from pakutoma's work. I was previously reluctant to enable this new logic for non-strict detection mode. My intention was to reduce the scope of behavior changes, since almost *any* behavior change may affect *some* user in a way we don't expect. However, we definitely have users whose (production) code was broken by the changes I made in 28b346b, and enabling pakutoma's check functions for non-strict detection mode would un-break it. (See phpGH-10192 as an example.) The added checks do also make sense. In non-strict detection mode, we will not immediately reject candidate encodings whose validity check function returns false; but they will be much less likely to be selected. However, failure of the validity check function is weighted less heavily than an encoding error detected by the encoding conversion function.

alexdowad · 2023-05-14T13:08:25Z

Just regenerated Zend/Optimizer/zend_func_infos.h; mercifully, CI caught the fact that I hadn't done that.

Trying again. Failure on LINUX_X64_RELEASE_ZTS is spurious. Error:

ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
Error: Process completed with exit code 255.

youkidearitai · 2023-05-14T13:49:18Z

ext/mbstring/mbstring.c

@@ -1016,6 +1016,7 @@ ZEND_TSRMLS_CACHE_UPDATE();
 	mbstring_globals->internal_encoding_set = 0;
 	mbstring_globals->http_output_set = 0;
 	mbstring_globals->http_input_set = 0;
+	mbstring_globals->all_encodings_list = NULL;


~~I think ext/mbstring/mbstring.h needs to prototype declaration~~

Sorry, I overlooked.

youkidearitai · 2023-05-14T14:20:48Z

Looks good to me.

Anyway #7871 can't resolved yet...(Although I'm not looking for accuracy in mb_detect_encoding)

$ sapi/cli/php -r "var_dump(mb_detect_encoding('🥳', ['UTF-8', 'ISO-8859-1']));"
string(10) "ISO-8859-1"

dstogov

The Optimizer change is fine.

iluuu1994

Looks good from my side!

Girgias

LGTM

alexdowad · 2023-05-16T14:03:42Z

Thanks for the review, all!

Landed on master.

github-actions bot added the Extension: mbstring label May 14, 2023

alexdowad and others added 3 commits May 14, 2023 05:52

alexdowad force-pushed the shared branch from 7a5b4a3 to 156098a Compare May 14, 2023 12:52

alexdowad requested a review from dstogov as a code owner May 14, 2023 12:52

github-actions bot added the Category: Optimizer label May 14, 2023

youkidearitai reviewed May 14, 2023

View reviewed changes

dstogov reviewed May 15, 2023

View reviewed changes

iluuu1994 approved these changes May 15, 2023

View reviewed changes

Girgias approved these changes May 15, 2023

View reviewed changes

alexdowad closed this May 16, 2023

alexdowad deleted the shared branch May 16, 2023 14:04

nielsdos mentioned this pull request Nov 11, 2023

mb_detect_encoding() results for UTF-7 differ between PHP 8.0 and 8.1 (if UTF-7 is present in the encodings list and the string contains '+' character) #10192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider order of candidate encodings when guessing source encoding of input string #11239

Consider order of candidate encodings when guessing source encoding of input string #11239

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

youkidearitai May 14, 2023 •

edited

Loading

Uh oh!

youkidearitai May 14, 2023

Uh oh!

youkidearitai commented May 14, 2023

Uh oh!

dstogov left a comment

Uh oh!

iluuu1994 left a comment

Uh oh!

Girgias left a comment

Uh oh!

alexdowad commented May 16, 2023

Uh oh!

Uh oh!

Consider order of candidate encodings when guessing source encoding of input string #11239

Consider order of candidate encodings when guessing source encoding of input string #11239

Uh oh!

Conversation

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

alexdowad commented May 14, 2023

Uh oh!

youkidearitai May 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkidearitai May 14, 2023

Choose a reason for hiding this comment

Uh oh!

youkidearitai commented May 14, 2023

Uh oh!

dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

iluuu1994 left a comment

Choose a reason for hiding this comment

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

alexdowad commented May 16, 2023

Uh oh!

Uh oh!

youkidearitai May 14, 2023 •

edited

Loading