wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

cristicotet · 2023-02-01T11:45:27Z

Description

The following code:

$str1 = '14';
$str2 = 'DQ';
$encodings1 = ['ASCII', 'UTF-8', 'UTF-16', 'UTF-16LE'];
$encodings2 = ['UTF-8', 'UTF-16', 'UTF-16LE'];
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings1, true)."\n";
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings2, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings1, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings2, true)."\n";

Resulted in this output:

# version 8.1.0 - 8.2.2 - wrong output
3134 - UTF-16
3134 - UTF-16
4451 - UTF-16LE
4451 - UTF-16LE

But I expected this output instead:

# version 5.4.0 - 8.0.27 - correct output
3134 - ASCII
3134 - UTF-8
4451 - ASCII
4451 - UTF-8

PHP Version

PHP 8.2.2

Operating System

Red Hat Enterprise Linux 9.1 / CentOS Linux 7

The text was updated successfully, but these errors were encountered:

stramunin · 2023-02-01T13:22:52Z

This seems to be a duplicate of #8279

hormus · 2023-02-03T16:41:54Z

<?php

$str1 = '14';
$str2 = 'DQ';
$encodings1 = ['UTF-16LE', 'UTF-16', 'UTF-8', 'ASCII'];
$encodings2 = ['UTF-16LE', 'UTF-16', 'UTF-8'];
function is_multiencodings($list, $string) {
$current = '0';
foreach($list as $key => $val) {
$old = $current;
$current = mb_detect_encoding($string, $val, true);
if(!is_string($current)) {
if($old) {
$current = '' . $key;
} else {
$current = '0';
}
}
if($old && $current != $old) {
return true;
}
}
}
var_dump(is_multiencodings($encodings1, $str1));
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings1, true)."\n";
echo bin2hex($str1) . ' - ' .  mb_detect_encoding($str1, $encodings2, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings1, true)."\n";
echo bin2hex($str2) . ' - ' .  mb_detect_encoding($str2, $encodings2, true)."\n";
?>

Read function is_multiencodings. Documentation issue. At least since php 8.1 it recognizes encoding other than a single byte e.x. UTF-16

alexdowad · 2023-02-05T12:52:08Z

This seems to be a duplicate of #8279

You are very right, it is a duplicate.

The documentation for mb_detect_encoding states that the function "detects the most likely character encoding for string string from an ordered list of candidates." However, before PHP 8.1, the function did not do this at all. In most cases, it would just return the first candidate encoding in the list, if the input string was valid in that encoding. However, it now uses various heuristics to pick the "most likely character encoding", just as the documentation states. (If you prefer the old behavior, then mb_check_encoding is what you need.)

The test strings which @cristicotet kindly provided ("14" and "DQ") are valid in ASCII, UTF-8, UTF-16BE, UTF-16LE, and many, many legacy text encodings. This is to be expected, since the strings are only two bytes long. It is not possible to 'detect' the intended text encoding for such short strings. It just happens that in UTF-16BE, the first string decodes to a character which is commonly used in Korean texts, and in UTF-16LE, the second string decodes to a character which is commonly used in Chinese and Japanese texts.

You might say that your application rarely handles Korean, Chinese, or Japanese text, but mb_detect_encoding doesn't know that. (We don't currently use the value of mbstring.language to help mb_detect_encoding determine which candidate encoding to pick when the input string is valid in more than one of them, and I think this is probably for the best.)

I am open to boosting the estimated likelihood of the first candidate encoding in the list. This would be in harmony with the documentation, which says that mb_detect_encoding takes an ordered list of candidates. However, this would need to be done carefully. Again, if all you want is to pick the first encoding in the list when it works, then what you want is mb_check_encoding.

For any interested persons who come across this thread in the future, please note that mb_detect_encoding is the most over-used, and inappropriately used, function in mbstring. Most applications should not use mb_detect_encoding. However, for the few applications which really need mb_detect_encoding, users should always:

Provide mb_detect_encoding with as much input text as possible. Encoding detection is impossible on very short strings.
Provide as short a list of candidate encodings as possible. @cristicotet has done well to provide only 3/4 candidate encodings in the sample code; but many users pass mb_list_encodings(), which is always a bad idea.
Never perform actions which might cause data corruption based on the return value of mb_detect_encoding (because sometimes it will guess wrongly). If you convert input text based on the value of mb_detect_encoding, always keep a copy of the original text.

For users who want to help improve mbstring, #7871 is still open, but is stalled because we need more input from actual users of mb_detect_encoding.

Girgias · 2023-02-05T15:31:28Z

Closed as duplicate

cristicotet added Bug Status: Needs Triage labels Feb 1, 2023

Girgias assigned alexdowad Feb 2, 2023

Girgias added the Extension: mbstring label Feb 2, 2023

Girgias closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023

Girgias added Status: Duplicate and removed Status: Needs Triage labels Feb 5, 2023

RoSk0 mentioned this issue Aug 8, 2023

PHP > 8.1 fails to detect Māori macrons #11908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

cristicotet commented Feb 1, 2023 •

edited

Loading

stramunin commented Feb 1, 2023

Uh oh!

hormus commented Feb 3, 2023 •

edited

Loading

Uh oh!

alexdowad commented Feb 5, 2023

Uh oh!

Girgias commented Feb 5, 2023

Uh oh!

wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

wrong mb_detect_encoding since php8.1 for very simple utf-8 strings #10481

Comments

cristicotet commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PHP Version

Operating System

stramunin commented Feb 1, 2023

Uh oh!

hormus commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Feb 5, 2023

Uh oh!

Girgias commented Feb 5, 2023

Uh oh!

cristicotet commented Feb 1, 2023 •

edited

Loading

hormus commented Feb 3, 2023 •

edited

Loading