Skip to content

Commit e84c03c

Browse files
authored
Merge pull request #12 from sparsetech/feat/to-latin
Implement Cyrillic to Latin conversion
2 parents 5ddaa9f + bd5ce6b commit e84c03c

File tree

8 files changed

+297
-90
lines changed

8 files changed

+297
-90
lines changed

README.md

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
[![Build Status](http://ci.sparse.tech/api/badges/sparsetech/translit-scala/status.svg)](http://ci.sparse.tech/sparsetech/translit-scala)
33
[![Maven Central](https://img.shields.io/maven-central/v/tech.sparse/translit-scala_2.12.svg)](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22tech.sparse%22%20AND%20a%3A%22translit-scala_2.12%22)
44

5-
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet.
5+
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet and vice-versa.
66

77
## Compatibility
88
| Back end | Scala versions |
@@ -52,7 +52,7 @@ We decompose letters in their Latin transliteration more consistently than Natio
5252
* Volodymyr (Володимир)
5353
* blyz'ko (близько)
5454

55-
The Latin letter *y* is also the phonetic basis of four letters in the Slavic alphabet: я, є, ї, ю. They get transliterated accordingly:
55+
The Latin letter *y* forms the phonetic basis of four letters (iotated vowels) in the Ukrainian alphabet: я, є, ї, ю. They get transliterated accordingly:
5656

5757
* ya → я
5858
* ye → є
@@ -63,20 +63,23 @@ Unlike National 2010, we always use the same transliteration regardless of the p
6363

6464
The accented counterpart of и is й and is represented by a separate letter, *j*.
6565

66-
*Example:* Zhurs'kyj (Згурський)
66+
*Example:* Zgurs'kyj (Згурський)
6767

6868
#### Soft Signs and Apostrophes
6969
The second change to National 2010 is that we try to restore soft signs and apostrophes:
7070

7171
* Ukrayins'kyj (Український), malen'kyj (маленький)
7272
* m'yaso (м'ясо), matir'yu (матір'ю)
7373

74+
In National 2010, *g* gets mapped to *ґ* which is phonetically accurate, though the letter *ґ* is fairly uncommon in Ukrainian. Therefore, we represent *ґ* by the bi-gram *g'*.
75+
7476
This feature is experimental and can be disabled by setting `apostrophes` to `false`.
7577

7678
#### Convenience mappings
7779
Another modification was to provide the following mappings:
7880

7981
* c → ц
82+
* h → х
8083
* q → щ
8184
* w → ш
8285
* x → ж
@@ -91,9 +94,8 @@ Note that these mappings are phonetically inaccurate. However, using them still
9194
* Another advantage is the proximity on the English keyboard layout:
9295
* *q* and *w* are located next to each other; *ш* and *щ* characters are phonetically close
9396
* *z* and *x* are located next to each other; *з* and *ж* characters are phonetically close
94-
95-
#### Precedence
96-
The replacement patterns are applied sequentially by traversing the input character-by-character. In some cases, a rule spanning multiple characters should not be applied. An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can place a vertical bar between the two characters. The full transliteration then looks as follows: *s|hyl'nist*
97+
* *h* is mapped to *х* since it is a common letter, *kh* is only needed in case *h* is ambiguous
98+
* An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can use the bi-gram *kh* instead to represent *х*. The full transliteration then looks as follows: *skhyl'nist*
9799

98100
## Russian
99101
The Russian rules are similar to the Ukrainian ones.
@@ -103,13 +105,7 @@ Some differences are:
103105
* *i* corresponds to *и*, whereas *y* to *ы*
104106
* Russian distinguishes between soft and hard signs. It does not have apostrophes. The following mappings are used:
105107
* Soft sign: *'* for ь
106-
* Hard sign: *"* for ъ
107-
108-
### Precedence
109-
As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules from being applied.
110-
111-
* красивые: krasivy|e
112-
* сходить: s|hodit
108+
* Hard sign: *`* for ъ
113109

114110
### Mapping
115111
| Latin | Cyrillic |
@@ -121,7 +117,7 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
121117
| e | е |
122118
| f | ф |
123119
| g | г |
124-
| h | х |
120+
| h, kh | х |
125121
| i | и |
126122
| j | й |
127123
| k | к |
@@ -141,7 +137,7 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
141137
| y | ы |
142138
| z | з |
143139
| ' | ь |
144-
| " | ъ |
140+
| \` | ъ |
145141
| ch | ч |
146142
| sh | ш |
147143
| ya | я |
@@ -151,15 +147,30 @@ As with the Ukrainian rules, a vertical bar can be placed to avoid certain rules
151147
| yu | ю |
152148
| shch | щ |
153149

154-
#### Examples
155-
| Russian | Transliterated |
156-
|---------|----------------|
157-
| Привет | Privet |
158-
| Съел | S"el |
159-
| Щётка | Shchyotka |
160-
| Льдина | L'dina |
150+
### Examples
151+
| Russian | Transliterated |
152+
|----------|----------------|
153+
| Привет | Privet |
154+
| Съел | S\`el |
155+
| Щётка | Shchyotka |
156+
| Льдина | L'dina |
157+
| красивые | krasivye |
158+
| сходить | skhodit' |
159+
160+
## Internals
161+
The replacement patterns are applied sequentially by traversing the input character-by-character. The functions `latinToCyrillicIncremental` and `cyrillicToLatinIncremental` take the left context which is needed by some rules, for example to determine the correct case of soft/hard signs. The result of the functions indicates the number of characters to remove on the right as well as their string replacement.
162+
163+
```scala
164+
def latinToCyrillicIncremental(
165+
latin: String, cyrillic: String, append: Char
166+
): (Int, String)
167+
168+
def cyrillicToLatinIncremental(
169+
cyrillic: String, letter: Char
170+
): (Int, String)
171+
```
161172

162-
### Credits
173+
## Credits
163174
The rules and examples were adapted from the following libraries:
164175

165176
* [translit-english-ukrainian](https://github.com/MarkovSergii/translit-english-ukrainian)

shared/src/main/scala/translit/Helpers.scala

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
package translit
22

33
object Helpers {
4+
def applyCase(str: String, isUpper: Boolean): String =
5+
if (isUpper) str(0).toUpper + str.tail else str
6+
47
def restoreCaseAll(str: String, cyrillic: Char): Char =
58
if (str.forall(_.isUpper)) cyrillic.toUpper else cyrillic
69

shared/src/main/scala/translit/Language.scala

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ trait Language {
1515
latin: String, cyrillic: String, append: Char
1616
): (Int, String)
1717

18+
def cyrillicToLatinIncremental(cyrillic: String, letter: Char): (Int, String)
19+
1820
def latinToCyrillic(text: String): String = {
1921
val result = new StringBuilder(text.length)
2022
var offset = 0
@@ -29,4 +31,19 @@ trait Language {
2931

3032
result.mkString
3133
}
34+
35+
def cyrillicToLatin(text: String): String = {
36+
val result = new StringBuilder(text.length * 2)
37+
var offset = 0
38+
39+
while (offset < text.length) {
40+
val (length, c) = cyrillicToLatinIncremental(
41+
text.take(offset), text(offset))
42+
if (length < 0) result.setLength(result.length + length)
43+
result.append(c)
44+
offset += 1
45+
}
46+
47+
result.mkString
48+
}
3249
}

shared/src/main/scala/translit/Noop.scala

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,9 @@ object Noop extends translit.Language {
44
override def latinToCyrillicIncremental(
55
latin: String, cyrillic: String, append: Char
66
): (Int, String) = (0, append.toString)
7+
8+
override def cyrillicToLatinIncremental(
9+
cyrillic: String, letter: Char
10+
): (Int, String) = (0, letter.toString)
711
}
812

shared/src/main/scala/translit/Russian.scala

Lines changed: 94 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,13 @@ object Russian extends Language {
2929
'w' -> 'ш',
3030
'x' -> 'ж',
3131
'y' -> 'ы',
32-
'z' -> 'з',
32+
'z' -> 'з'
33+
)
34+
35+
// Infer case from previous character
36+
val uniGramsSpecial = Map(
3337
'\'' -> 'ь',
34-
'"' -> 'ъ'
38+
'`' -> 'ъ'
3539
)
3640

3741
val biGrams = Map(
@@ -42,9 +46,7 @@ object Russian extends Language {
4246
"zh" -> 'ж',
4347
"yo" -> 'ё',
4448
"yu" -> 'ю',
45-
46-
"y|" -> 'ы', // красивые, выучил
47-
"s|" -> 'с' // сходить
49+
"kh" -> 'х'
4850
)
4951

5052
val triGrams = Map[String, Char]()
@@ -53,31 +55,98 @@ object Russian extends Language {
5355
"shch" -> 'щ'
5456
)
5557

58+
val uniGramsInv = uniGrams.toList.map(_.swap).toMap
59+
val uniGramsSpecialInv = uniGramsSpecial.toList.map(_.swap).toMap
60+
val biGramsInv = biGrams.toList.map(_.swap).toMap
61+
val triGramsInv = triGrams.toList.map(_.swap).toMap
62+
val fourGramsInv = fourGrams.toList.map(_.swap).toMap
63+
64+
// y after m/n/r/t/v will be rendered as ы unless it is iotated
65+
val yLetters = Set("my", "ny", "ry", "ty", "vy")
66+
67+
// If the y is iotated, render it as я, ё or ю
68+
val iotatedLetters = Set("ya", "yo", "yu")
69+
5670
override def latinToCyrillicIncremental(
5771
latin: String, cyrillic: String, append: Char
5872
): (Int, String) = {
5973
val text = latin + append
6074
val ofs = text.length
61-
if (ofs >= 4 &&
62-
fourGrams.contains(text.substring(ofs - 4, ofs).toLowerCase)) {
63-
val chars = text.substring(ofs - 4, ofs)
64-
val cyrillic = fourGrams(chars.toLowerCase)
65-
(-2, restoreCaseFirst(chars, cyrillic).toString)
66-
} else if (ofs >= 3 &&
67-
triGrams.contains(text.substring(ofs - 3, ofs).toLowerCase)) {
68-
val chars = text.substring(ofs - 3, ofs)
69-
val cyrillic = triGrams(chars.toLowerCase)
70-
(-2, restoreCaseFirst(chars, cyrillic).toString)
71-
} else if (ofs >= 2 &&
72-
biGrams.contains(text.substring(ofs - 2, ofs).toLowerCase)) {
73-
val chars = text.substring(ofs - 2, ofs)
74-
val cyrillic = biGrams(chars.toLowerCase)
75-
(-1, restoreCaseFirst(chars, cyrillic).toString)
76-
} else if (uniGrams.contains(text(ofs - 1).toLower)) {
77-
val cyrillic = uniGrams(text(ofs - 1).toLower)
78-
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
79-
} else {
80-
(0, text(ofs - 1).toString)
75+
val result =
76+
if (ofs >= 4 &&
77+
fourGrams.contains(text.substring(ofs - 4, ofs).toLowerCase)) {
78+
val chars = text.substring(ofs - 4, ofs)
79+
val cyrillic = fourGrams(chars.toLowerCase)
80+
(-2, restoreCaseFirst(chars, cyrillic).toString)
81+
} else if (ofs >= 3
82+
&& yLetters.contains(text.substring(ofs - 3, ofs - 1).toLowerCase)
83+
&& !iotatedLetters.contains(text.substring(ofs - 2, ofs).toLowerCase)
84+
) {
85+
val cyrillic = uniGrams.getOrElse(text(ofs - 1).toLower, text(ofs - 1))
86+
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
87+
} else if (ofs >= 3 &&
88+
triGrams.contains(text.substring(ofs - 3, ofs).toLowerCase)) {
89+
val chars = text.substring(ofs - 3, ofs)
90+
val cyrillic = triGrams(chars.toLowerCase)
91+
(-2, restoreCaseFirst(chars, cyrillic).toString)
92+
} else if (ofs >= 2 &&
93+
biGrams.contains(text.substring(ofs - 2, ofs).toLowerCase)) {
94+
val chars = text.substring(ofs - 2, ofs)
95+
val cyrillic = biGrams(chars.toLowerCase)
96+
(-1, restoreCaseFirst(chars, cyrillic).toString)
97+
} else if (uniGrams.contains(text(ofs - 1).toLower)) {
98+
val cyrillic = uniGrams(text(ofs - 1).toLower)
99+
(0, (if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic).toString)
100+
} else if (ofs >= 2 && uniGramsSpecial.contains(text(ofs - 1))) {
101+
val result =
102+
if (ofs >= 3 && text(ofs - 2).isUpper && text(ofs - 3).isUpper)
103+
uniGramsSpecial(text(ofs - 1)).toUpper
104+
else uniGramsSpecial(text(ofs - 1))
105+
(0, result.toString)
106+
} else {
107+
(0, text(ofs - 1).toString)
108+
}
109+
110+
if (ofs >= 3 && uniGramsSpecial.contains(text(ofs - 2))) {
111+
val (l, r) = (text(ofs - 3), text(ofs - 1))
112+
val letter = uniGramsSpecial(text(ofs - 2))
113+
val replace = if (l.isUpper && r.isUpper) letter.toUpper else letter
114+
val cyrillicOfs = cyrillic.length - 1
115+
116+
if (replace == cyrillic(cyrillicOfs)) result
117+
else {
118+
val updated = replace + cyrillic.substring(
119+
cyrillicOfs + 1, cyrillic.length + result._1)
120+
(-updated.length + result._1, updated + result._2)
121+
}
122+
} else result
123+
}
124+
125+
private def toLatin(letter: Char): String = {
126+
val isUpper = letter.isUpper
127+
val letterLc = letter.toLower
128+
fourGramsInv.get(letterLc).map(applyCase(_, isUpper))
129+
.orElse(triGramsInv.get(letterLc).map(applyCase(_, isUpper)))
130+
.orElse(biGramsInv.get(letterLc).map(applyCase(_, isUpper)))
131+
.orElse(uniGramsInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
132+
.orElse(uniGramsSpecialInv.get(letterLc).map(x => applyCase(x.toString, isUpper)))
133+
.getOrElse(letter.toString)
134+
}
135+
136+
override def cyrillicToLatinIncremental(
137+
cyrillic: String, letter: Char
138+
): (Int, String) = {
139+
val current = toLatin(letter)
140+
141+
val changeCase =
142+
letter.isUpper &&
143+
(cyrillic.length == 1 || cyrillic.lastOption.exists(_.isUpper))
144+
145+
if (!changeCase) (0, current)
146+
else {
147+
val mapped = toLatin(cyrillic.last)
148+
val rest = mapped.tail
149+
(-rest.length, rest.toUpperCase + current.toUpperCase)
81150
}
82151
}
83152
}

0 commit comments

Comments
 (0)