Skip to content

Commit 656b498

Browse files
committed
Implement Cyrillic to Latin conversion
In the process of implementing the reverse direction, several limitations in the rules were addressed. The adapted rules were tested on a small Wikipedia text corpus and a decrease in the error rate was witnessed. Notable changes include the removal of the vertical bar (|) for precedence. Instead, the bi-gram "kh" was introduced. "ъ" is now mapped onto ` instead of " since double quotes are commonly used in Slavic texts. The rules now also handle different cases of capital letters. Closes #2.
1 parent 5ddaa9f commit 656b498

File tree

8 files changed

+286
-90
lines changed

8 files changed

+286
-90
lines changed

README.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
[![Build Status](http://ci.sparse.tech/api/badges/sparsetech/translit-scala/status.svg)](http://ci.sparse.tech/sparsetech/translit-scala)
33
[![Maven Central](https://img.shields.io/maven-central/v/tech.sparse/translit-scala_2.12.svg)](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22tech.sparse%22%20AND%20a%3A%22translit-scala_2.12%22)
44

5-
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet.
5+
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet and vice-versa.
66

77
## Compatibility
88
| Back end | Scala versions |
@@ -52,7 +52,7 @@ We decompose letters in their Latin transliteration more consistently than Natio
5252
* Volodymyr (Володимир)
5353
* blyz'ko (близько)
5454

55-
The Latin letter *y* is also the phonetic basis of four letters in the Slavic alphabet: я, є, ї, ю. They get transliterated accordingly:
55+
The Latin letter *y* is also the phonetic basis of four letters (iotated vowels) in the Ukrainian alphabet: я, є, ї, ю. They get transliterated accordingly:
5656

5757
* ya → я
5858
* ye → є
@@ -63,20 +63,23 @@ Unlike National 2010, we always use the same transliteration regardless of the p
6363

6464
The accented counterpart of и is й and is represented by a separate letter, *j*.
6565

66-
*Example:* Zhurs'kyj (Згурський)
66+
*Example:* Zgurs'kyj (Згурський)
6767

6868
#### Soft Signs and Apostrophes
6969
The second change to National 2010 is that we try to restore soft signs and apostrophes:
7070

7171
* Ukrayins'kyj (Український), malen'kyj (маленький)
7272
* m'yaso (м'ясо), matir'yu (матір'ю)
7373

74+
In National 2010, *g* is mapped to *ґ* which is phonetically accurate, though the letter is fairly uncommon in Ukrainian. Therefore, *ґ* is represented by *g'*.
75+
7476
This feature is experimental and can be disabled by setting `apostrophes` to `false`.
7577

7678
#### Convenience mappings
7779
Another modification was to provide the following mappings:
7880

7981
* c → ц
82+
* h → х
8083
* q → щ
8184
* w → ш
8285
* x → ж
@@ -91,9 +94,7 @@ Note that these mappings are phonetically inaccurate. However, using them still
9194
* Another advantage is the proximity on the English keyboard layout:
9295
* *q* and *w* are located next to each other; *ш* and *щ* characters are phonetically close
9396
* *z* and *x* are located next to each other; *з* and *ж* characters are phonetically close
94-
95-
#### Precedence
96-
The replacement patterns are applied sequentially by traversing the input character-by-character. In some cases, a rule spanning multiple characters should not be applied. An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can place a vertical bar between the two characters. The full transliteration then looks as follows: *s|hyl'nist*
97+
* *h* is mapped to *х* since it is a common letter, *kh* is only needed in case *h* is ambiguous
9798

9899
## Russian
99100
The Russian rules are similar to the Ukrainian ones.
@@ -103,13 +104,7 @@ Some differences are:
103104
* *i* corresponds to *и*, whereas *y* to *ы*
104105
* Russian distinguishes between soft and hard signs. It does not have apostrophes. The following mappings are used:
105106
* Soft sign: *'* for ь
106-
* Hard sign: *"* for ъ
107-
108-
### Precedence
109-
As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules from being applied.
110-
111-
* красивые: krasivy|e
112-
* сходить: s|hodit
107+
* Hard sign: *`* for ъ
113108

114109
### Mapping
115110
| Latin | Cyrillic |
@@ -121,7 +116,7 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
121116
| e | е |
122117
| f | ф |
123118
| g | г |
124-
| h | х |
119+
| h, kh | х |
125120
| i | и |
126121
| j | й |
127122
| k | к |
@@ -141,7 +136,7 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
141136
| y | ы |
142137
| z | з |
143138
| ' | ь |
144-
| " | ъ |
139+
| \` | ъ |
145140
| ch | ч |
146141
| sh | ш |
147142
| ya | я |
@@ -151,15 +146,20 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
151146
| yu | ю |
152147
| shch | щ |
153148

154-
#### Examples
155-
| Russian | Transliterated |
156-
|---------|----------------|
157-
| Привет | Privet |
158-
| Съел | S"el |
159-
| Щётка | Shchyotka |
160-
| Льдина | L'dina |
149+
### Examples
150+
| Russian | Transliterated |
151+
|----------|----------------|
152+
| Привет | Privet |
153+
| Съел | S\`el |
154+
| Щётка | Shchyotka |
155+
| Льдина | L'dina |
156+
| красивые | krasivye |
157+
| сходить | skhodit' |
158+
159+
## Internals
160+
The replacement patterns are applied sequentially by traversing the input character-by-character. The functions `latinToCyrillicIncremental` and `cyrillicToLatinIncremental` take the left context which is needed for some rules. The result indicates the number of characters to remove and a replacement string.
161161

162-
### Credits
162+
## Credits
163163
The rules and examples were adapted from the following libraries:
164164

165165
* [translit-english-ukrainian](https://github.com/MarkovSergii/translit-english-ukrainian)

shared/src/main/scala/translit/Helpers.scala

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
package translit
22

33
object Helpers {
4+
def applyCase(str: String, isUpper: Boolean): String =
5+
if (isUpper) str(0).toUpper + str.tail else str
6+
47
def restoreCaseAll(str: String, cyrillic: Char): Char =
58
if (str.forall(_.isUpper)) cyrillic.toUpper else cyrillic
69

shared/src/main/scala/translit/Language.scala

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ trait Language {
1515
latin: String, cyrillic: String, append: Char
1616
): (Int, String)
1717

18+
def cyrillicToLatinIncremental(cyrillic: String, letter: Char): (Int, String)
19+
1820
def latinToCyrillic(text: String): String = {
1921
val result = new StringBuilder(text.length)
2022
var offset = 0
@@ -29,4 +31,19 @@ trait Language {
2931

3032
result.mkString
3133
}
34+
35+
def cyrillicToLatin(text: String): String = {
36+
val result = new StringBuilder(text.length * 2)
37+
var offset = 0
38+
39+
while (offset < text.length) {
40+
val (length, c) = cyrillicToLatinIncremental(
41+
text.take(offset), text(offset))
42+
if (length < 0) result.setLength(result.length + length)
43+
result.append(c)
44+
offset += 1
45+
}
46+
47+
result.mkString
48+
}
3249
}

shared/src/main/scala/translit/Noop.scala

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,9 @@ object Noop extends translit.Language {
44
override def latinToCyrillicIncremental(
55
latin: String, cyrillic: String, append: Char
66
): (Int, String) = (0, append.toString)
7+
8+
override def cyrillicToLatinIncremental(
9+
cyrillic: String, letter: Char
10+
): (Int, String) = (0, letter.toString)
711
}
812

shared/src/main/scala/translit/Russian.scala

Lines changed: 94 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,13 @@ object Russian extends Language {
2929
'w' -> 'ш',
3030
'x' -> 'ж',
3131
'y' -> 'ы',
32-
'z' -> 'з',
32+
'z' -> 'з'
33+
)
34+
35+
// Infer case from previous character
36+
val uniGramsSpecial = Map(
3337
'\'' -> 'ь',
34-
'"' -> 'ъ'
38+
'`' -> 'ъ'
3539
)
3640

3741
val biGrams = Map(
@@ -42,9 +46,7 @@ object Russian extends Language {
4246
"zh" -> 'ж',
4347
"yo" -> 'ё',
4448
"yu" -> 'ю',
45-
46-
"y|" -> 'ы', // красивые, выучил
47-
"s|" -> 'с' // сходить
49+
"kh" -> 'х'
4850
)
4951

5052
val triGrams = Map[String, Char]()
@@ -53,31 +55,98 @@ object Russian extends Language {
5355
"shch" -> 'щ'
5456
)
5557

58+
val uniGramsInv = uniGrams.toList.map(_.swap).toMap
59+
val uniGramsSpecialInv = uniGramsSpecial.toList.map(_.swap).toMap
60+
val biGramsInv = biGrams.toList.map(_.swap).toMap
61+
val triGramsInv = triGrams.toList.map(_.swap).toMap
62+
val fourGramsInv = fourGrams.toList.map(_.swap).toMap
63+
64+
// y after m/n/r/t/v will be rendered as ы unless it is iotated
65+
val yLetters = Set("my", "ny", "ry", "ty", "vy")
66+
67+
// If the y is iotated, render it as я, ё or ю
68+
val iotatedLetters = Set("ya", "yo", "yu")
69+
5670
override def latinToCyrillicIncremental(
5771
latin: String, cyrillic: String, append: Char
5872
): (Int, String) = {
5973
val text = latin + append
6074
val ofs = text.length
61-
if (ofs >= 4 &&
62-
fourGrams.contains(text.substring(ofs - 4, ofs).toLowerCase)) {
63-
val chars = text.substring(ofs - 4, ofs)
64-
val cyrillic = fourGrams(chars.toLowerCase)
65-
(-2, restoreCaseFirst(chars, cyrillic).toString)
66-
} else if (ofs >= 3 &&
67-
triGrams.contains(text.substring(ofs - 3, ofs).toLowerCase)) {
68-
val chars = text.substring(ofs - 3, ofs)
69-
val cyrillic = triGrams(chars.toLowerCase)
70-
(-2, restoreCaseFirst(chars, cyrillic).toString)
71-
} else if (ofs >= 2 &&
72-
biGrams.contains(text.substring(ofs - 2, ofs).toLowerCase)) {
73-
val chars = text.substring(ofs - 2, ofs)
74-
val cyrillic = biGrams(chars.toLowerCase)
75-
(-1, restoreCaseFirst(chars, cyrillic).toString)
76-
} else if (uniGrams.contains(text(ofs - 1).toLower)) {
77-
val cyrillic = uniGrams(text(ofs - 1).toLower)
78-
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
79-
} else {
80-
(0, text(ofs - 1).toString)
75+
val result =
76+
if (ofs >= 4 &&
77+
fourGrams.contains(text.substring(ofs - 4, ofs).toLowerCase)) {
78+
val chars = text.substring(ofs - 4, ofs)
79+
val cyrillic = fourGrams(chars.toLowerCase)
80+
(-2, restoreCaseFirst(chars, cyrillic).toString)
81+
} else if (ofs >= 3
82+
&& yLetters.contains(text.substring(ofs - 3, ofs - 1).toLowerCase)
83+
&& !iotatedLetters.contains(text.substring(ofs - 2, ofs).toLowerCase)
84+
) {
85+
val cyrillic = uniGrams.getOrElse(text(ofs - 1).toLower, text(ofs - 1))
86+
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
87+
} else if (ofs >= 3 &&
88+
triGrams.contains(text.substring(ofs - 3, ofs).toLowerCase)) {
89+
val chars = text.substring(ofs - 3, ofs)
90+
val cyrillic = triGrams(chars.toLowerCase)
91+
(-2, restoreCaseFirst(chars, cyrillic).toString)
92+
} else if (ofs >= 2 &&
93+
biGrams.contains(text.substring(ofs - 2, ofs).toLowerCase)) {
94+
val chars = text.substring(ofs - 2, ofs)
95+
val cyrillic = biGrams(chars.toLowerCase)
96+
(-1, restoreCaseFirst(chars, cyrillic).toString)
97+
} else if (uniGrams.contains(text(ofs - 1).toLower)) {
98+
val cyrillic = uniGrams(text(ofs - 1).toLower)
99+
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
100+
} else if (ofs >= 2 && uniGramsSpecial.contains(text(ofs - 1))) {
101+
val result =
102+
if (ofs >= 3 && text(ofs - 2).isUpper && text(ofs - 3).isUpper)
103+
uniGramsSpecial(text(ofs - 1)).toUpper
104+
else uniGramsSpecial(text(ofs - 1))
105+
(0, result.toString)
106+
} else {
107+
(0, text(ofs - 1).toString)
108+
}
109+
110+
if (ofs >= 3 && uniGramsSpecial.contains(text(ofs - 2))) {
111+
val (l, r) = (text(ofs - 3), text(ofs - 1))
112+
val letter = uniGramsSpecial(text(ofs - 2))
113+
val replace = if (l.isUpper && r.isUpper) letter.toUpper else letter
114+
val cyrillicOfs = cyrillic.length - 1
115+
116+
if (replace == cyrillic(cyrillicOfs)) result
117+
else {
118+
val updated = replace + cyrillic.substring(
119+
cyrillicOfs + 1, cyrillic.length + result._1)
120+
(-updated.length + result._1, updated + result._2)
121+
}
122+
} else result
123+
}
124+
125+
private def toLatin(letter: Char): String = {
126+
val isUpper = letter.isUpper
127+
val letterLc = letter.toLower
128+
fourGramsInv.get(letterLc).map(applyCase(_, isUpper))
129+
.orElse(triGramsInv.get(letterLc).map(applyCase(_, isUpper)))
130+
.orElse(biGramsInv.get(letterLc).map(applyCase(_, isUpper)))
131+
.orElse(uniGramsInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
132+
.orElse(uniGramsSpecialInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
133+
.getOrElse(letter.toString)
134+
}
135+
136+
override def cyrillicToLatinIncremental(
137+
cyrillic: String, letter: Char
138+
): (Int, String) = {
139+
val current = toLatin(letter)
140+
141+
val changeCase =
142+
letter.isUpper &&
143+
(cyrillic.length == 1 || cyrillic.lastOption.exists(_.isUpper))
144+
145+
if (!changeCase) (0, current)
146+
else {
147+
val mapped = toLatin(cyrillic.last)
148+
val rest = mapped.tail
149+
(-rest.length, rest.toUpperCase + current.toUpperCase)
81150
}
82151
}
83152
}

shared/src/main/scala/translit/Ukrainian.scala

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ class Ukrainian(apostrophes: Boolean) extends Language {
1414
'e' -> 'е',
1515
'f' -> 'ф',
1616
'g' -> 'г',
17-
'h' -> 'х',
1817
'i' -> 'і',
1918
'j' -> 'й',
2019
'k' -> 'к',
@@ -34,6 +33,7 @@ class Ukrainian(apostrophes: Boolean) extends Language {
3433
// Mappings for more convenient typing. Allows us to cover every letter of
3534
// the Latin alphabet
3635
'c' -> 'ц',
36+
'h' -> 'х',
3737
'q' -> 'щ',
3838
'w' -> 'ш',
3939
'x' -> 'ж'
@@ -52,8 +52,7 @@ class Ukrainian(apostrophes: Boolean) extends Language {
5252
"ts" -> 'ц',
5353
"zh" -> 'ж',
5454

55-
// With the vertical bar, transliteration can be disabled.
56-
"s|" -> 'с'
55+
"kh" -> 'х'
5756
)
5857

5958
val triGrams = Map[String, Char]()
@@ -62,6 +61,15 @@ class Ukrainian(apostrophes: Boolean) extends Language {
6261
"shch" -> 'щ'
6362
)
6463

64+
val uniGramsInv = uniGrams.toList.map(_.swap).toMap
65+
val uniGramsSpecialInv = Map(
66+
'ь' -> '\'',
67+
'\'' -> '\''
68+
)
69+
val biGramsInv = biGrams.toList.map(_.swap).toMap
70+
val triGramsInv = triGrams.toList.map(_.swap).toMap
71+
val fourGramsInv = fourGrams.toList.map(_.swap).toMap
72+
6573
val apostrophePatterns = Set(
6674
('b', "ya"),
6775
('b', "ye"),
@@ -157,6 +165,37 @@ class Ukrainian(apostrophes: Boolean) extends Language {
157165
}
158166
} else result
159167
}
168+
169+
private def toLatin(letter: Char): String = {
170+
val isUpper = letter.isUpper
171+
val letterLc = letter.toLower
172+
fourGramsInv.get(letterLc).map(applyCase(_, isUpper))
173+
.orElse(triGramsInv.get(letterLc).map(applyCase(_, isUpper)))
174+
.orElse(biGramsInv.get(letterLc).map(applyCase(_, isUpper)))
175+
.orElse(uniGramsInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
176+
.orElse(uniGramsSpecialInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
177+
.getOrElse(letter.toString)
178+
}
179+
180+
override def cyrillicToLatinIncremental(
181+
cyrillic: String, letter: Char
182+
): (Int, String) = {
183+
val current = toLatin(letter)
184+
185+
val changeCase =
186+
letter.isUpper && {
187+
val withoutApostrophes = cyrillic.filter(_ != '\'')
188+
withoutApostrophes.length == 1 ||
189+
withoutApostrophes.lastOption.exists(_.isUpper)
190+
}
191+
192+
if (!changeCase) (0, current)
193+
else {
194+
val mapped = toLatin(cyrillic.last)
195+
val rest = mapped.tail
196+
(-rest.length, rest.toUpperCase + current.toUpperCase)
197+
}
198+
}
160199
}
161200

162201
object Ukrainian extends Ukrainian(apostrophes = true)

0 commit comments

Comments
 (0)