Skip to content

Commit e7d674a

Browse files
authored
Merge pull request #13 from sparsetech/issue/3
Evaluate performance on 1M+ Wikipedia word corpora
2 parents e84c03c + 55700d2 commit e7d674a

File tree

9 files changed

+348
-222
lines changed

9 files changed

+348
-222
lines changed

.gitignore

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
.bloop/
2-
31
*.class
42
*.log
53

@@ -17,4 +15,7 @@ project/plugins/project/
1715
# Scala-IDE specific
1816
.scala_dependencies
1917
.worksheet
20-
.idea
18+
19+
.bloop/
20+
/.idea/
21+
/.metals/

README.md

Lines changed: 45 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,19 @@
44

55
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet and vice-versa.
66

7+
## Features
8+
* Supported languages
9+
* Russian
10+
* Ukrainian
11+
* Incremental transliteration
12+
* Transliterations reversible with an accuracy of 99.99%
13+
* Transliterations optimised for typing and reading
14+
* Only letters from the US keyboard are used
15+
* All common letters can be typed with a single keystroke
16+
* Convenience shortcuts are provided
17+
* Cross-platform support (JVM, Scala.js)
18+
* Zero dependencies
19+
720
## Compatibility
821
| Back end | Scala versions |
922
|:-----------|:---------------|
@@ -19,14 +32,7 @@ libraryDependencies += "tech.sparse" %%% "translit-scala" % "0.1.1" // JavaScri
1932
## Examples
2033
```scala
2134
translit.Ukrainian.latinToCyrillic("Kyyiv") // Київ
22-
translit.Russian.latinToCyrillic("pal'ma") // пальма
23-
```
24-
25-
The transliteration to Cyrillic also restores soft signs (ь) and apostrophes ('):
26-
27-
```scala
28-
translit.Ukrainian.latinToCyrillic("Mar'yana") // Мар'яна
29-
translit.Ukrainian.latinToCyrillic("p'yat'") // п'ять
35+
translit.Russian.latinToCyrillic("pal`ma") // пальма
3036
```
3137

3238
## Ukrainian
@@ -35,7 +41,7 @@ There have been several attempts to standardise transliteration rules. For examp
3541
* Ukrayins'kyy pravopys (BGN/PCGN 1965)
3642
* Ukrains'kyi pravopys (National 1996)
3743
* Ukrainskyi pravopys ([National 2010](http://zakon1.rada.gov.ua/laws/show/55-2010-%D0%BF))
38-
* Ukrayins'kyj pravopys (*translit-scala*)
44+
* Ukrayins\`kyj pravopys (*translit-scala*)
3945

4046
Furthermore, there are language-specific transliterations, e.g. in German and French, that use the spelling conventions of the respective language (*sch* in German instead of *sh* in English).
4147

@@ -50,7 +56,7 @@ Our transliteration was initially based on National 2010, but modified in the pr
5056
We decompose letters in their Latin transliteration more consistently than National 2010. The letter и always gets transcribed as *y*:
5157

5258
* Volodymyr (Володимир)
53-
* blyz'ko (близько)
59+
* blyz\`ko (близько)
5460

5561
The Latin letter *y* forms the phonetic basis of four letters (iotated vowels) in the Ukrainian alphabet: я, є, ї, ю. They get transliterated accordingly:
5662

@@ -63,18 +69,16 @@ Unlike National 2010, we always use the same transliteration regardless of the p
6369

6470
The accented counterpart of и is й and is represented by a separate letter, *j*.
6571

66-
*Example:* Zgurs'kyj (Згурський)
72+
*Example:* Zgurs\`kyj (Згурський)
6773

6874
#### Soft Signs and Apostrophes
69-
The second change to National 2010 is that we try to restore soft signs and apostrophes:
75+
The second change to National 2010 is that we retain soft signs (ь) and apostrophes ('):
7076

71-
* Ukrayins'kyj (Український), malen'kyj (маленький)
77+
* Ukrayins\`kyj (Український), malen\`kyj (маленький)
7278
* m'yaso (м'ясо), matir'yu (матір'ю)
7379

7480
In National 2010, *g* gets mapped to *ґ* which is phonetically accurate, though the letter *ґ* is fairly uncommon in Ukrainian. Therefore, we represent *ґ* by the bi-gram *g'*.
7581

76-
This feature is experimental and can be disabled by setting `apostrophes` to `false`.
77-
7882
#### Convenience mappings
7983
Another modification was to provide the following mappings:
8084

@@ -95,17 +99,24 @@ Note that these mappings are phonetically inaccurate. However, using them still
9599
* *q* and *w* are located next to each other; *ш* and *щ* characters are phonetically close
96100
* *z* and *x* are located next to each other; *з* and *ж* characters are phonetically close
97101
* *h* is mapped to *х* since it is a common letter, *kh* is only needed in case *h* is ambiguous
98-
* An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can use the bi-gram *kh* instead to represent *х*. The full transliteration then looks as follows: *skhyl'nist*
102+
* An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can use the bi-gram *kh* instead to represent *х*. The full transliteration then looks as follows: *skhyl\`nist*
103+
104+
#### Escaping
105+
You can insert a backslash (\\) if a replacement rule should not be applied, for example:
106+
107+
* stranstvy\e -> странствие
108+
* pidtry\annya -> підтриання
109+
* Bash\charshiyi -> Башчаршії
99110

100111
## Russian
101112
The Russian rules are similar to the Ukrainian ones.
102113

103114
Some differences are:
104115

105116
* *i* corresponds to *и*, whereas *y* to *ы*
106-
* Russian distinguishes between soft and hard signs. It does not have apostrophes. The following mappings are used:
107-
* Soft sign: *'* for ь
108-
* Hard sign: *`* for ъ
117+
* Russian distinguishes between soft and hard signs. Apostrophes only appear in foreign names. The following mappings are used:
118+
* Soft sign: *`* for ь
119+
* Hard sign: *~* for ъ
109120

110121
### Mapping
111122
| Latin | Cyrillic |
@@ -136,8 +147,8 @@ Some differences are:
136147
| x | ж |
137148
| y | ы |
138149
| z | з |
139-
| ' | ь |
140-
| \` | ъ |
150+
| \` | ь |
151+
| ~ | ъ |
141152
| ch | ч |
142153
| sh | ш |
143154
| ya | я |
@@ -151,11 +162,11 @@ Some differences are:
151162
| Russian | Transliterated |
152163
|----------|----------------|
153164
| Привет | Privet |
154-
| Съел | S\`el |
165+
| Съел | S~el |
155166
| Щётка | Shchyotka |
156-
| Льдина | L'dina |
157-
| красивые | krasivye |
158-
| сходить | skhodit' |
167+
| Льдина | L\`dina |
168+
| красивые | krasivy\e |
169+
| сходить | skhodit\` |
159170

160171
## Internals
161172
The replacement patterns are applied sequentially by traversing the input character-by-character. The functions `latinToCyrillicIncremental` and `cyrillicToLatinIncremental` take the left context which is needed by some rules, for example to determine the correct case of soft/hard signs. The result of the functions indicates the number of characters to remove on the right as well as their string replacement.
@@ -170,8 +181,16 @@ def cyrillicToLatinIncremental(
170181
): (Int, String)
171182
```
172183

184+
## Performance
185+
The test suite evaluates whether transliterations are reversible. The accuracy is calculated on words extracted from Wikipedia article dumps for all supported languages. Words are transliterated to Latin and then back to Cyrillic. A word counts as correct if the result of the reversed transliteration matches the original.
186+
187+
| Language | Total | Correct | Accuracy |
188+
|-----------|-----------|-----------|----------|
189+
| Ukrainian | 1,811,772 | 1,811,661 | 99.99% |
190+
| Russian | 1,529,184 | 1,529,043 | 99.99% |
191+
173192
## Credits
174-
The rules and examples were adapted from the following libraries:
193+
The rules and examples were adapted from the following libraries and websites:
175194

176195
* [translit-english-ukrainian](https://github.com/MarkovSergii/translit-english-ukrainian)
177196
* [translit-ua](https://github.com/dchaplinsky/translit-ua)

build.sbt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,12 @@ lazy val translit = crossProject.in(file("."))
5050
/* Use io.js for faster compilation of test cases */
5151
scalaJSStage in Global := FastOptStage
5252
)
53+
.jvmSettings(
54+
libraryDependencies ++= Seq(
55+
"com.github.pathikrit" %%% "better-files" % "3.8.0" % "test",
56+
"org.scalaj" %%% "scalaj-http" % "2.4.2" % "test"
57+
)
58+
)
5359

5460
lazy val js = translit.js
5561
lazy val jvm = translit.jvm

build.toml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,19 @@ root = "shared"
1919
sources = ["shared/src/main/scala"]
2020
targets = ["js", "jvm"]
2121

22+
[module.translit.jvm]
23+
root = "jvm"
24+
2225
[module.translit.test]
2326
sources = ["shared/src/test/scala"]
2427
targets = ["js", "jvm"]
2528
scalaDeps = [
2629
["org.scalatest", "scalatest", "3.0.8"]
2730
]
31+
32+
[module.translit.test.jvm]
33+
sources = ["jvm/src/test/scala"]
34+
scalaDeps = [
35+
["com.github.pathikrit", "better-files", "3.8.0"],
36+
["org.scalaj", "scalaj-http", "2.4.2"]
37+
]
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
package translit
2+
3+
import java.nio.charset.Charset
4+
5+
import better.files._
6+
import org.scalatest.FunSuite
7+
import scalaj.http.Http
8+
9+
class CorpusSpec extends FunSuite {
10+
def getDump(name: String): File = {
11+
val file = File(s"/tmp/$name")
12+
if (file.exists) file
13+
else {
14+
println(s"Downloading $name...")
15+
val http = Http(s"http://builds.sparse.tech/wiki/$name").asBytes
16+
assert(http.isSuccess)
17+
file.writeByteArray(http.body)
18+
}
19+
}
20+
21+
def evaluate(dump: String, language: Language): (Long, Long) = {
22+
val sentences = getDump(dump)
23+
24+
var correct = 0L
25+
var total = 0L
26+
27+
sentences.gzipInputStream().foreach { gzip =>
28+
val words = gzip.lines(Charset.forName("UTF-8"))
29+
val latin = (('a' to 'z') ++ ('A' to 'Z')).toSet
30+
31+
words.filter(!_.exists(latin.contains)).foreach { original =>
32+
val latin = language.cyrillicToLatin(original)
33+
val cyrillic = language.latinToCyrillic(latin)
34+
if (original == cyrillic) {
35+
correct += 1
36+
} else {
37+
println(s"Mismatch: $original vs $cyrillic")
38+
}
39+
40+
total += 1
41+
}
42+
}
43+
44+
(correct, total)
45+
}
46+
47+
def check(perf: (Long, Long), expectedMinimum: Double): Unit = {
48+
val (correct, total) = perf
49+
val accuracy = correct.toDouble / total
50+
51+
println(s"Correct : $correct")
52+
println(s"Total : $total")
53+
println(s"Accuracy: $accuracy")
54+
55+
assert(accuracy > expectedMinimum)
56+
}
57+
58+
test("Russian") {
59+
val perf = evaluate("tokens-ru-1710634.txt.gz", Russian)
60+
check(perf, 0.999907)
61+
}
62+
63+
test("Ukrainian") {
64+
val perf = evaluate("tokens-uk-2000041.txt.gz", Ukrainian)
65+
check(perf, 0.999938)
66+
}
67+
}

shared/src/main/scala/translit/Russian.scala

Lines changed: 46 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -34,16 +34,16 @@ object Russian extends Language {
3434

3535
// Infer case from previous character
3636
val uniGramsSpecial = Map(
37-
'\'' -> 'ь',
38-
'`' -> 'ъ'
37+
'`' -> 'ь',
38+
'~' -> 'ъ'
3939
)
4040

4141
val biGrams = Map(
4242
"ch" -> 'ч',
4343
"sh" -> 'ш',
44+
"zh" -> 'ж',
4445
"ya" -> 'я',
4546
"ye" -> 'э',
46-
"zh" -> 'ж',
4747
"yo" -> 'ё',
4848
"yu" -> 'ю',
4949
"kh" -> 'х'
@@ -55,6 +55,15 @@ object Russian extends Language {
5555
"shch" -> 'щ'
5656
)
5757

58+
val escape = Map(
59+
"ya" -> "ыа",
60+
"ye" -> "ые",
61+
"yo" -> "ыо",
62+
"yu" -> "ыу",
63+
"shch" -> "шч"
64+
)
65+
val escapeCharacter = '\\'
66+
5867
val uniGramsInv = uniGrams.toList.map(_.swap).toMap
5968
val uniGramsSpecialInv = uniGramsSpecial.toList.map(_.swap).toMap
6069
val biGramsInv = biGrams.toList.map(_.swap).toMap
@@ -64,20 +73,32 @@ object Russian extends Language {
6473
// y after m/n/r/t/v will be rendered as ы unless it is iotated
6574
val yLetters = Set("my", "ny", "ry", "ty", "vy")
6675

67-
// If the y is iotated, render it as я, ё or ю
68-
val iotatedLetters = Set("ya", "yo", "yu")
76+
// If the y is iotated, render it as я, э, ё or ю
77+
val iotatedLetters = Set("ya", "ye", "yo", "yu")
6978

7079
override def latinToCyrillicIncremental(
7180
latin: String, cyrillic: String, append: Char
7281
): (Int, String) = {
7382
val text = latin + append
7483
val ofs = text.length
7584
val result =
76-
if (ofs >= 4 &&
85+
if (ofs >= 5 && text.takeRight(5).toLowerCase == "sh\\ch") {
86+
val cyrillic = 'ч'
87+
val result = if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic
88+
(-2, result.toString)
89+
} else if (ofs >= 4 &&
7790
fourGrams.contains(text.substring(ofs - 4, ofs).toLowerCase)) {
7891
val chars = text.substring(ofs - 4, ofs)
7992
val cyrillic = fourGrams(chars.toLowerCase)
8093
(-2, restoreCaseFirst(chars, cyrillic).toString)
94+
} else if (ofs >= 3 &&
95+
escape.keySet.contains(
96+
(text(ofs - 3).toString + text(ofs - 1)).toLowerCase
97+
) && text(ofs - 2) == escapeCharacter
98+
) {
99+
val cyrillic = uniGrams.getOrElse(text(ofs - 1).toLower, text(ofs - 1))
100+
val result = if (text(ofs - 1).isUpper) cyrillic.toUpper else cyrillic
101+
(-1, result.toString)
81102
} else if (ofs >= 3
82103
&& yLetters.contains(text.substring(ofs - 3, ofs - 1).toLowerCase)
83104
&& !iotatedLetters.contains(text.substring(ofs - 2, ofs).toLowerCase)
@@ -109,8 +130,10 @@ object Russian extends Language {
109130

110131
if (ofs >= 3 && uniGramsSpecial.contains(text(ofs - 2))) {
111132
val (l, r) = (text(ofs - 3), text(ofs - 1))
112-
val letter = uniGramsSpecial(text(ofs - 2))
113-
val replace = if (l.isUpper && r.isUpper) letter.toUpper else letter
133+
134+
val letter = uniGramsSpecial(text(ofs - 2))
135+
val replace =
136+
if (l.isUpper && r.isUpper) letter.toUpper else letter
114137
val cyrillicOfs = cyrillic.length - 1
115138

116139
if (replace == cyrillic(cyrillicOfs)) result
@@ -136,17 +159,23 @@ object Russian extends Language {
136159
override def cyrillicToLatinIncremental(
137160
cyrillic: String, letter: Char
138161
): (Int, String) = {
139-
val current = toLatin(letter)
140-
141-
val changeCase =
142-
letter.isUpper &&
143-
(cyrillic.length == 1 || cyrillic.lastOption.exists(_.isUpper))
162+
val current = toLatin(letter)
163+
val toEscape = cyrillic.lastOption
164+
.map(_.toLower.toString + letter.toLower)
165+
.exists(escape.values.toList.contains)
144166

145-
if (!changeCase) (0, current)
167+
if (toEscape) (0, escapeCharacter + current)
146168
else {
147-
val mapped = toLatin(cyrillic.last)
148-
val rest = mapped.tail
149-
(-rest.length, rest.toUpperCase + current.toUpperCase)
169+
val changeCase =
170+
letter.isUpper &&
171+
(cyrillic.length == 1 || cyrillic.lastOption.exists(_.isUpper))
172+
173+
if (!changeCase) (0, current)
174+
else {
175+
val mapped = toLatin(cyrillic.last)
176+
val rest = mapped.tail
177+
(-rest.length, rest.toUpperCase + current.toUpperCase)
178+
}
150179
}
151180
}
152181
}

0 commit comments

Comments
 (0)