-
Notifications
You must be signed in to change notification settings - Fork 7.9k
wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This seems to be a duplicate of #8279 |
<?php
$str1 = '14';
$str2 = 'DQ';
$encodings1 = ['UTF-16LE', 'UTF-16', 'UTF-8', 'ASCII'];
$encodings2 = ['UTF-16LE', 'UTF-16', 'UTF-8'];
function is_multiencodings($list, $string) {
$current = '0';
foreach($list as $key => $val) {
$old = $current;
$current = mb_detect_encoding($string, $val, true);
if(!is_string($current)) {
if($old) {
$current = '' . $key;
} else {
$current = '0';
}
}
if($old && $current != $old) {
return true;
}
}
}
var_dump(is_multiencodings($encodings1, $str1));
echo bin2hex($str1) . ' - ' . mb_detect_encoding($str1, $encodings1, true)."\n";
echo bin2hex($str1) . ' - ' . mb_detect_encoding($str1, $encodings2, true)."\n";
echo bin2hex($str2) . ' - ' . mb_detect_encoding($str2, $encodings1, true)."\n";
echo bin2hex($str2) . ' - ' . mb_detect_encoding($str2, $encodings2, true)."\n";
?> Read function is_multiencodings. Documentation issue. At least since php 8.1 it recognizes encoding other than a single byte e.x. UTF-16 |
You are very right, it is a duplicate. The documentation for The test strings which @cristicotet kindly provided ("14" and "DQ") are valid in ASCII, UTF-8, UTF-16BE, UTF-16LE, and many, many legacy text encodings. This is to be expected, since the strings are only two bytes long. It is not possible to 'detect' the intended text encoding for such short strings. It just happens that in UTF-16BE, the first string decodes to a character which is commonly used in Korean texts, and in UTF-16LE, the second string decodes to a character which is commonly used in Chinese and Japanese texts. You might say that your application rarely handles Korean, Chinese, or Japanese text, but I am open to boosting the estimated likelihood of the first candidate encoding in the list. This would be in harmony with the documentation, which says that For any interested persons who come across this thread in the future, please note that
For users who want to help improve mbstring, #7871 is still open, but is stalled because we need more input from actual users of |
Closed as duplicate |
Uh oh!
There was an error while loading. Please reload this page.
Description
The following code:
https://3v4l.org/ehb6U#veol
Resulted in this output:
But I expected this output instead:
PHP Version
PHP 8.2.2
Operating System
Red Hat Enterprise Linux 9.1 / CentOS Linux 7
The text was updated successfully, but these errors were encountered: