Skip to content

wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cristicotet opened this issue Feb 1, 2023 · 4 comments
Closed

Comments

@cristicotet
Copy link

cristicotet commented Feb 1, 2023

Description

The following code:

https://3v4l.org/ehb6U#veol

$str1 = '14';
$str2 = 'DQ';
$encodings1 = ['ASCII', 'UTF-8', 'UTF-16', 'UTF-16LE'];
$encodings2 = ['UTF-8', 'UTF-16', 'UTF-16LE'];
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings1, true)."\n";
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings2, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings1, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings2, true)."\n";

Resulted in this output:

# version 8.1.0 - 8.2.2 - wrong output
3134 - UTF-16
3134 - UTF-16
4451 - UTF-16LE
4451 - UTF-16LE

But I expected this output instead:

# version 5.4.0 - 8.0.27 - correct output
3134 - ASCII
3134 - UTF-8
4451 - ASCII
4451 - UTF-8

PHP Version

PHP 8.2.2

Operating System

Red Hat Enterprise Linux 9.1 / CentOS Linux 7

@stramunin
Copy link

This seems to be a duplicate of #8279

@hormus
Copy link

hormus commented Feb 3, 2023

<?php

$str1 = '14';
$str2 = 'DQ';
$encodings1 = ['UTF-16LE', 'UTF-16', 'UTF-8', 'ASCII'];
$encodings2 = ['UTF-16LE', 'UTF-16', 'UTF-8'];
function is_multiencodings($list, $string) {
$current = '0';
foreach($list as $key => $val) {
$old = $current;
$current = mb_detect_encoding($string, $val, true);
if(!is_string($current)) {
if($old) {
$current = '' . $key;
} else {
$current = '0';
}
}
if($old && $current != $old) {
return true;
}
}
}
var_dump(is_multiencodings($encodings1, $str1));
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings1, true)."\n";
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings2, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings1, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings2, true)."\n";
?>

Read function is_multiencodings. Documentation issue. At least since php 8.1 it recognizes encoding other than a single byte e.x. UTF-16

@alexdowad
Copy link
Contributor

This seems to be a duplicate of #8279

You are very right, it is a duplicate.

The documentation for mb_detect_encoding states that the function "detects the most likely character encoding for string string from an ordered list of candidates." However, before PHP 8.1, the function did not do this at all. In most cases, it would just return the first candidate encoding in the list, if the input string was valid in that encoding. However, it now uses various heuristics to pick the "most likely character encoding", just as the documentation states. (If you prefer the old behavior, then mb_check_encoding is what you need.)

The test strings which @cristicotet kindly provided ("14" and "DQ") are valid in ASCII, UTF-8, UTF-16BE, UTF-16LE, and many, many legacy text encodings. This is to be expected, since the strings are only two bytes long. It is not possible to 'detect' the intended text encoding for such short strings. It just happens that in UTF-16BE, the first string decodes to a character which is commonly used in Korean texts, and in UTF-16LE, the second string decodes to a character which is commonly used in Chinese and Japanese texts.

You might say that your application rarely handles Korean, Chinese, or Japanese text, but mb_detect_encoding doesn't know that. (We don't currently use the value of mbstring.language to help mb_detect_encoding determine which candidate encoding to pick when the input string is valid in more than one of them, and I think this is probably for the best.)

I am open to boosting the estimated likelihood of the first candidate encoding in the list. This would be in harmony with the documentation, which says that mb_detect_encoding takes an ordered list of candidates. However, this would need to be done carefully. Again, if all you want is to pick the first encoding in the list when it works, then what you want is mb_check_encoding.


For any interested persons who come across this thread in the future, please note that mb_detect_encoding is the most over-used, and inappropriately used, function in mbstring. Most applications should not use mb_detect_encoding. However, for the few applications which really need mb_detect_encoding, users should always:

  • Provide mb_detect_encoding with as much input text as possible. Encoding detection is impossible on very short strings.
  • Provide as short a list of candidate encodings as possible. @cristicotet has done well to provide only 3/4 candidate encodings in the sample code; but many users pass mb_list_encodings(), which is always a bad idea.
  • Never perform actions which might cause data corruption based on the return value of mb_detect_encoding (because sometimes it will guess wrongly). If you convert input text based on the value of mb_detect_encoding, always keep a copy of the original text.

For users who want to help improve mbstring, #7871 is still open, but is stalled because we need more input from actual users of mb_detect_encoding.

@Girgias
Copy link
Member

Girgias commented Feb 5, 2023

Closed as duplicate

@Girgias Girgias closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants