Skip to content

'royin' engine gives wrong romanization in a lot of cases #415

@bact

Description

@bact

Try this test set:

from pythainlp.transliterate import romanize
test_cases = {
    None: "",
    "": "",
    "หมอก": "mok",
    "หาย": "hai",
    "แมว": "maeo",
    "เดือน": "duean",
    "ดำ": "dam",
    "ดู": "du",
    "บัว": "bua",
    "กก": "kok",
    "กร": "kon",
    "กรร": "kan",
    "กรรม": "kam",
    "กรม": "krom",  # failed
    "ฝ้าย": "fai",
    "นพพร": "nopphon",
    "ทีปกร": "thipakon",  # failed
    "ธรรพ์": "than",  # failed
    "ธรรม": "tham",  # failed
    "มหา": "maha",  # failed
    "หยาก": "yak",  # failed
    "อยาก": "yak",  # failed
    "ยมก": "yamok",  # failed
    "กลัว": "klua",  # failed
    "บ้านไร่": "banrai",  # failed
    "ชารินทร์": "charin",  # failed
}
for word in test_cases:
    expect = test_cases[word]
    actual = romanize(word, engine="royin")
    print(f"{expect == actual} - word: {word} expect: {expect} actual: {actual}")

Half of them will failed:

True - word: None expect:  actual: 
True - word:  expect:  actual: 
True - word: หมอก expect: mok actual: mok
True - word: หาย expect: hai actual: hai
True - word: แมว expect: maeo actual: maeo
True - word: เดือน expect: duean actual: duean
True - word: ดำ expect: dam actual: dam
True - word: ดู expect: du actual: du
True - word: บัว expect: bua actual: bua
True - word: กก expect: kok actual: kok
True - word: กร expect: kon actual: kon
True - word: กรร expect: kan actual: kan
True - word: กรรม expect: kam actual: kam
False - word: กรม expect: krom actual: knm
True - word: ฝ้าย expect: fai actual: fai
True - word: นพพร expect: nopphon actual: nopphon
False - word: ทีปกร expect: thipakon actual: thipkon
False - word: ธรรพ์ expect: than actual: thonrop
False - word: ธรรม expect: tham actual: thnnm
False - word: มหา expect: maha actual: ma
False - word: หยาก expect: yak actual: hyak
False - word: อยาก expect: yak actual: ak
False - word: ยมก expect: yamok actual: ymk
False - word: กลัว expect: klua actual: knua
False - word: บ้านไร่ expect: banrai actual: bannai
False - word: ชารินทร์ expect: charin actual: charinthon

This test set will be added to test_transliterate.py.


Consistency Test

# these are set of two-syllable words,
# to test if the transliteration/romanization is consistent, say
# romanize(1+2) = romanize(1) + romanize(2)
_CONSISTENCY_TESTS = [
    # ("กระจก", "กระ", "จก"),  # failed
    # ("ระเบิด", "ระ", "เบิด"),  # failed
    # ("หยากไย่", "หยาก", "ไย่"),  # failed
    ("ตากใบ", "ตาก", "ใบ"),
    # ("จัดสรร", "จัด", "สรร"),  # failed
]

def test_romanize_royin_consistency(self):
    for word, part1, part2 in _CONSISTENCY_TESTS:
        self.assertEqual(
            romanize(word, engine="royin"),
            (
                romanize(part1, engine="royin")
                + romanize(part2, engine="royin")
            ),
        )

In general, I think we need a more systematic evaluation of different algorithms, including soundex.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Hacktoberfestfor Hacktoberfest eventbugbugs in the library

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions