-
Notifications
You must be signed in to change notification settings - Fork 285
Open
Labels
Hacktoberfestfor Hacktoberfest eventfor Hacktoberfest eventbugbugs in the librarybugs in the library
Milestone
Description
Try this test set:
from pythainlp.transliterate import romanize
test_cases = {
None: "",
"": "",
"หมอก": "mok",
"หาย": "hai",
"แมว": "maeo",
"เดือน": "duean",
"ดำ": "dam",
"ดู": "du",
"บัว": "bua",
"กก": "kok",
"กร": "kon",
"กรร": "kan",
"กรรม": "kam",
"กรม": "krom", # failed
"ฝ้าย": "fai",
"นพพร": "nopphon",
"ทีปกร": "thipakon", # failed
"ธรรพ์": "than", # failed
"ธรรม": "tham", # failed
"มหา": "maha", # failed
"หยาก": "yak", # failed
"อยาก": "yak", # failed
"ยมก": "yamok", # failed
"กลัว": "klua", # failed
"บ้านไร่": "banrai", # failed
"ชารินทร์": "charin", # failed
}
for word in test_cases:
expect = test_cases[word]
actual = romanize(word, engine="royin")
print(f"{expect == actual} - word: {word} expect: {expect} actual: {actual}")
Half of them will failed:
True - word: None expect: actual:
True - word: expect: actual:
True - word: หมอก expect: mok actual: mok
True - word: หาย expect: hai actual: hai
True - word: แมว expect: maeo actual: maeo
True - word: เดือน expect: duean actual: duean
True - word: ดำ expect: dam actual: dam
True - word: ดู expect: du actual: du
True - word: บัว expect: bua actual: bua
True - word: กก expect: kok actual: kok
True - word: กร expect: kon actual: kon
True - word: กรร expect: kan actual: kan
True - word: กรรม expect: kam actual: kam
False - word: กรม expect: krom actual: knm
True - word: ฝ้าย expect: fai actual: fai
True - word: นพพร expect: nopphon actual: nopphon
False - word: ทีปกร expect: thipakon actual: thipkon
False - word: ธรรพ์ expect: than actual: thonrop
False - word: ธรรม expect: tham actual: thnnm
False - word: มหา expect: maha actual: ma
False - word: หยาก expect: yak actual: hyak
False - word: อยาก expect: yak actual: ak
False - word: ยมก expect: yamok actual: ymk
False - word: กลัว expect: klua actual: knua
False - word: บ้านไร่ expect: banrai actual: bannai
False - word: ชารินทร์ expect: charin actual: charinthon
This test set will be added to test_transliterate.py
.
Consistency Test
# these are set of two-syllable words,
# to test if the transliteration/romanization is consistent, say
# romanize(1+2) = romanize(1) + romanize(2)
_CONSISTENCY_TESTS = [
# ("กระจก", "กระ", "จก"), # failed
# ("ระเบิด", "ระ", "เบิด"), # failed
# ("หยากไย่", "หยาก", "ไย่"), # failed
("ตากใบ", "ตาก", "ใบ"),
# ("จัดสรร", "จัด", "สรร"), # failed
]
def test_romanize_royin_consistency(self):
for word, part1, part2 in _CONSISTENCY_TESTS:
self.assertEqual(
romanize(word, engine="royin"),
(
romanize(part1, engine="royin")
+ romanize(part2, engine="royin")
),
)
In general, I think we need a more systematic evaluation of different algorithms, including soundex.
Metadata
Metadata
Assignees
Labels
Hacktoberfestfor Hacktoberfest eventfor Hacktoberfest eventbugbugs in the librarybugs in the library