You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+45-26Lines changed: 45 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,19 @@
4
4
5
5
translit-scala is a transliteration library for Scala and Scala.js. It implements transliteration rules for Slavic languages. It supports converting texts from the Latin to the Cyrillic alphabet and vice-versa.
6
6
7
+
## Features
8
+
* Supported languages
9
+
* Russian
10
+
* Ukrainian
11
+
* Incremental transliteration
12
+
* Transliterations reversible with an accuracy of 99.99%
13
+
* Transliterations optimised for typing and reading
14
+
* Only letters from the US keyboard are used
15
+
* All common letters can be typed with a single keystroke
Furthermore, there are language-specific transliterations, e.g. in German and French, that use the spelling conventions of the respective language (*sch* in German instead of *sh* in English).
41
47
@@ -50,7 +56,7 @@ Our transliteration was initially based on National 2010, but modified in the pr
50
56
We decompose letters in their Latin transliteration more consistently than National 2010. The letter и always gets transcribed as *y*:
51
57
52
58
* Volodymyr (Володимир)
53
-
* blyz'ko (близько)
59
+
* blyz\`ko (близько)
54
60
55
61
The Latin letter *y* forms the phonetic basis of four letters (iotated vowels) in the Ukrainian alphabet: я, є, ї, ю. They get transliterated accordingly:
56
62
@@ -63,18 +69,16 @@ Unlike National 2010, we always use the same transliteration regardless of the p
63
69
64
70
The accented counterpart of и is й and is represented by a separate letter, *j*.
65
71
66
-
*Example:* Zgurs'kyj (Згурський)
72
+
*Example:* Zgurs\`kyj (Згурський)
67
73
68
74
#### Soft Signs and Apostrophes
69
-
The second change to National 2010 is that we try to restore soft signs and apostrophes:
75
+
The second change to National 2010 is that we retain soft signs (ь) and apostrophes ('):
In National 2010, *g* gets mapped to *ґ* which is phonetically accurate, though the letter *ґ* is fairly uncommon in Ukrainian. Therefore, we represent *ґ* by the bi-gram *g'*.
75
81
76
-
This feature is experimental and can be disabled by setting `apostrophes` to `false`.
77
-
78
82
#### Convenience mappings
79
83
Another modification was to provide the following mappings:
80
84
@@ -95,17 +99,24 @@ Note that these mappings are phonetically inaccurate. However, using them still
95
99
**q* and *w* are located next to each other; *ш* and *щ* characters are phonetically close
96
100
**z* and *x* are located next to each other; *з* and *ж* characters are phonetically close
97
101
**h* is mapped to *х* since it is a common letter, *kh* is only needed in case *h* is ambiguous
98
-
* An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can use the bi-gram *kh* instead to represent *х*. The full transliteration then looks as follows: *skhyl'nist*
102
+
* An example is the word: схильність. The transliteration of *сх* corresponds to two separate letters *s* and *h*, which would map to *ш*. To prevent this, one can use the bi-gram *kh* instead to represent *х*. The full transliteration then looks as follows: *skhyl\`nist*
103
+
104
+
#### Escaping
105
+
You can insert a backslash (\\) if a replacement rule should not be applied, for example:
106
+
107
+
* stranstvy\e -> странствие
108
+
* pidtry\annya -> підтриання
109
+
* Bash\charshiyi -> Башчаршії
99
110
100
111
## Russian
101
112
The Russian rules are similar to the Ukrainian ones.
102
113
103
114
Some differences are:
104
115
105
116
**i* corresponds to *и*, whereas *y* to *ы*
106
-
* Russian distinguishes between soft and hard signs. It does not have apostrophes. The following mappings are used:
107
-
* Soft sign: *'* for ь
108
-
* Hard sign: *`* for ъ
117
+
* Russian distinguishes between soft and hard signs. Apostrophes only appear in foreign names. The following mappings are used:
118
+
* Soft sign: *`* for ь
119
+
* Hard sign: *~* for ъ
109
120
110
121
### Mapping
111
122
| Latin | Cyrillic |
@@ -136,8 +147,8 @@ Some differences are:
136
147
| x | ж |
137
148
| y | ы |
138
149
| z | з |
139
-
|' | ь |
140
-
|\`| ъ |
150
+
|\`| ь |
151
+
|~ | ъ |
141
152
| ch | ч |
142
153
| sh | ш |
143
154
| ya | я |
@@ -151,11 +162,11 @@ Some differences are:
151
162
| Russian | Transliterated |
152
163
|----------|----------------|
153
164
| Привет | Privet |
154
-
| Съел | S\`el|
165
+
| Съел | S~el |
155
166
| Щётка | Shchyotka |
156
-
| Льдина | L'dina|
157
-
| красивые |krasivye |
158
-
| сходить | skhodit' |
167
+
| Льдина | L\`dina |
168
+
| красивые |krasivy\e|
169
+
| сходить | skhodit\`|
159
170
160
171
## Internals
161
172
The replacement patterns are applied sequentially by traversing the input character-by-character. The functions `latinToCyrillicIncremental` and `cyrillicToLatinIncremental` take the left context which is needed by some rules, for example to determine the correct case of soft/hard signs. The result of the functions indicates the number of characters to remove on the right as well as their string replacement.
The test suite evaluates whether transliterations are reversible. The accuracy is calculated on words extracted from Wikipedia article dumps for all supported languages. Words are transliterated to Latin and then back to Cyrillic. A word counts as correct if the result of the reversed transliteration matches the original.
186
+
187
+
| Language | Total | Correct | Accuracy |
188
+
|-----------|-----------|-----------|----------|
189
+
| Ukrainian | 1,811,772 | 1,811,661 | 99.99% |
190
+
| Russian | 1,529,184 | 1,529,043 | 99.99% |
191
+
173
192
## Credits
174
-
The rules and examples were adapted from the following libraries:
193
+
The rules and examples were adapted from the following libraries and websites:
0 commit comments