- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.3k
Data Files
- Special Data Files
- Data Files for Version 4.00
- Data Files for Version 3.04/3.05
- Cube Data Files for Version 3.04/3.05
- Fraktur Data Files
- Data Files for Version 3.02
- Data Files for Version 2.0x
- Format of traineddata files
| Lang Code | Description | 4.0/3.0x traineddata | 
|---|---|---|
| osd | Orientation and script detection | osd.traineddata | 
| equ | Math / equation detection | equ.traineddata | 
Note: These two data files are compatible with older versions of Tesseract. osd is compatible with version 3.01 and up, and equ is compatible with version 3.02 and up.
Note: The kur data file was not updated from 3.04. For Fraktur, see the section Fraktur Data Files.
| Lang Code | Language | 4.0 traineddata | 
|---|---|---|
| afr | Afrikaans | afr.traineddata | 
| amh | Amharic | amh.traineddata | 
| ara | Arabic | ara.traineddata | 
| asm | Assamese | asm.traineddata | 
| aze | Azerbaijani | aze.traineddata | 
| aze_cyrl | Azerbaijani - Cyrilic | aze_cyrl.traineddata | 
| bel | Belarusian | bel.traineddata | 
| ben | Bengali | ben.traineddata | 
| bod | Tibetan | bod.traineddata | 
| bos | Bosnian | bos.traineddata | 
| bul | Bulgarian | bul.traineddata | 
| cat | Catalan; Valencian | cat.traineddata | 
| ceb | Cebuano | ceb.traineddata | 
| ces | Czech | ces.traineddata | 
| chi_sim | Chinese - Simplified | chi_sim.traineddata | 
| chi_tra | Chinese - Traditional | chi_tra.traineddata | 
| chr | Cherokee | chr.traineddata | 
| cym | Welsh | cym.traineddata | 
| dan | Danish | dan.traineddata | 
| deu | German | deu.traineddata | 
| dzo | Dzongkha | dzo.traineddata | 
| ell | Greek, Modern (1453-) | ell.traineddata | 
| eng | English | eng.traineddata | 
| enm | English, Middle (1100-1500) | enm.traineddata | 
| epo | Esperanto | epo.traineddata | 
| est | Estonian | est.traineddata | 
| eus | Basque | eus.traineddata | 
| fas | Persian | fas.traineddata | 
| fin | Finnish | fin.traineddata | 
| fra | French | fra.traineddata | 
| frk | Frankish | frk.traineddata | 
| frm | French, Middle (ca. 1400-1600) | frm.traineddata | 
| gle | Irish | gle.traineddata | 
| glg | Galician | glg.traineddata | 
| grc | Greek, Ancient (-1453) | grc.traineddata | 
| guj | Gujarati | guj.traineddata | 
| hat | Haitian; Haitian Creole | hat.traineddata | 
| heb | Hebrew | heb.traineddata | 
| hin | Hindi | hin.traineddata | 
| hrv | Croatian | hrv.traineddata | 
| hun | Hungarian | hun.traineddata | 
| iku | Inuktitut | iku.traineddata | 
| ind | Indonesian | ind.traineddata | 
| isl | Icelandic | isl.traineddata | 
| ita | Italian | ita.traineddata | 
| ita_old | Italian - Old | ita_old.traineddata | 
| jav | Javanese | jav.traineddata | 
| jpn | Japanese | jpn.traineddata | 
| kan | Kannada | kan.traineddata | 
| kat | Georgian | kat.traineddata | 
| kat_old | Georgian - Old | kat_old.traineddata | 
| kaz | Kazakh | kaz.traineddata | 
| khm | Central Khmer | khm.traineddata | 
| kir | Kirghiz; Kyrgyz | kir.traineddata | 
| kor | Korean | kor.traineddata | 
| kur | Kurdish | kur.traineddata | 
| lao | Lao | lao.traineddata | 
| lat | Latin | lat.traineddata | 
| lav | Latvian | lav.traineddata | 
| lit | Lithuanian | lit.traineddata | 
| mal | Malayalam | mal.traineddata | 
| mar | Marathi | mar.traineddata | 
| mkd | Macedonian | mkd.traineddata | 
| mlt | Maltese | mlt.traineddata | 
| msa | Malay | msa.traineddata | 
| mya | Burmese | mya.traineddata | 
| nep | Nepali | nep.traineddata | 
| nld | Dutch; Flemish | nld.traineddata | 
| nor | Norwegian | nor.traineddata | 
| ori | Oriya | ori.traineddata | 
| pan | Panjabi; Punjabi | pan.traineddata | 
| pol | Polish | pol.traineddata | 
| por | Portuguese | por.traineddata | 
| pus | Pushto; Pashto | pus.traineddata | 
| ron | Romanian; Moldavian; Moldovan | ron.traineddata | 
| rus | Russian | rus.traineddata | 
| san | Sanskrit | san.traineddata | 
| sin | Sinhala; Sinhalese | sin.traineddata | 
| slk | Slovak | slk.traineddata | 
| slv | Slovenian | slv.traineddata | 
| spa | Spanish; Castilian | spa.traineddata | 
| spa_old | Spanish; Castilian - Old | spa_old.traineddata | 
| sqi | Albanian | sqi.traineddata | 
| srp | Serbian | srp.traineddata | 
| srp_latn | Serbian - Latin | srp_latn.traineddata | 
| swa | Swahili | swa.traineddata | 
| swe | Swedish | swe.traineddata | 
| syr | Syriac | syr.traineddata | 
| tam | Tamil | tam.traineddata | 
| tel | Telugu | tel.traineddata | 
| tgk | Tajik | tgk.traineddata | 
| tgl | Tagalog | tgl.traineddata | 
| tha | Thai | tha.traineddata | 
| tir | Tigrinya | tir.traineddata | 
| tur | Turkish | tur.traineddata | 
| uig | Uighur; Uyghur | uig.traineddata | 
| ukr | Ukrainian | ukr.traineddata | 
| urd | Urdu | urd.traineddata | 
| uzb | Uzbek | uzb.traineddata | 
| uzb_cyrl | Uzbek - Cyrilic | uzb_cyrl.traineddata | 
| vie | Vietnamese | vie.traineddata | 
| yid | Yiddish | yid.traineddata | 
Note: For Arabic and Hindi you need both the traineddata file and the cube data files.
| Lang Code | Language | 3.04 traineddata | 
|---|---|---|
| afr | Afrikaans | afr.traineddata | 
| amh | Amharic | amh.traineddata | 
| ara | Arabic | ara.traineddata | 
| asm | Assamese | asm.traineddata | 
| aze | Azerbaijani | aze.traineddata | 
| aze_cyrl | Azerbaijani - Cyrilic | aze_cyrl.traineddata | 
| bel | Belarusian | bel.traineddata | 
| ben | Bengali | ben.traineddata | 
| bod | Tibetan | bod.traineddata | 
| bos | Bosnian | bos.traineddata | 
| bul | Bulgarian | bul.traineddata | 
| cat | Catalan; Valencian | cat.traineddata | 
| ceb | Cebuano | ceb.traineddata | 
| ces | Czech | ces.traineddata | 
| chi_sim | Chinese - Simplified | chi_sim.traineddata | 
| chi_tra | Chinese - Traditional | chi_tra.traineddata | 
| chr | Cherokee | chr.traineddata | 
| cym | Welsh | cym.traineddata | 
| dan | Danish | dan.traineddata | 
| deu | German | deu.traineddata | 
| dzo | Dzongkha | dzo.traineddata | 
| ell | Greek, Modern (1453-) | ell.traineddata | 
| eng | English | eng.traineddata | 
| enm | English, Middle (1100-1500) | enm.traineddata | 
| epo | Esperanto | epo.traineddata | 
| est | Estonian | est.traineddata | 
| eus | Basque | eus.traineddata | 
| fas | Persian | fas.traineddata | 
| fin | Finnish | fin.traineddata | 
| fra | French | fra.traineddata | 
| frk | Frankish | frk.traineddata | 
| frm | French, Middle (ca. 1400-1600) | frm.traineddata | 
| gle | Irish | gle.traineddata | 
| glg | Galician | glg.traineddata | 
| grc | Greek, Ancient (-1453) | grc.traineddata | 
| guj | Gujarati | guj.traineddata | 
| hat | Haitian; Haitian Creole | hat.traineddata | 
| heb | Hebrew | heb.traineddata | 
| hin | Hindi | hin.traineddata | 
| hrv | Croatian | hrv.traineddata | 
| hun | Hungarian | hun.traineddata | 
| iku | Inuktitut | iku.traineddata | 
| ind | Indonesian | ind.traineddata | 
| isl | Icelandic | isl.traineddata | 
| ita | Italian | ita.traineddata | 
| ita_old | Italian - Old | ita_old.traineddata | 
| jav | Javanese | jav.traineddata | 
| jpn | Japanese | jpn.traineddata | 
| kan | Kannada | kan.traineddata | 
| kat | Georgian | kat.traineddata | 
| kat_old | Georgian - Old | kat_old.traineddata | 
| kaz | Kazakh | kaz.traineddata | 
| khm | Central Khmer | khm.traineddata | 
| kir | Kirghiz; Kyrgyz | kir.traineddata | 
| kor | Korean | kor.traineddata | 
| kur | Kurdish | kur.traineddata | 
| lao | Lao | lao.traineddata | 
| lat | Latin | lat.traineddata | 
| lav | Latvian | lav.traineddata | 
| lit | Lithuanian | lit.traineddata | 
| mal | Malayalam | mal.traineddata | 
| mar | Marathi | mar.traineddata | 
| mkd | Macedonian | mkd.traineddata | 
| mlt | Maltese | mlt.traineddata | 
| msa | Malay | msa.traineddata | 
| mya | Burmese | mya.traineddata | 
| nep | Nepali | nep.traineddata | 
| nld | Dutch; Flemish | nld.traineddata | 
| nor | Norwegian | nor.traineddata | 
| ori | Oriya | ori.traineddata | 
| pan | Panjabi; Punjabi | pan.traineddata | 
| pol | Polish | pol.traineddata | 
| por | Portuguese | por.traineddata | 
| pus | Pushto; Pashto | pus.traineddata | 
| ron | Romanian; Moldavian; Moldovan | ron.traineddata | 
| rus | Russian | rus.traineddata | 
| san | Sanskrit | san.traineddata | 
| sin | Sinhala; Sinhalese | sin.traineddata | 
| slk | Slovak | slk.traineddata | 
| slv | Slovenian | slv.traineddata | 
| spa | Spanish; Castilian | spa.traineddata | 
| spa_old | Spanish; Castilian - Old | spa_old.traineddata | 
| sqi | Albanian | sqi.traineddata | 
| srp | Serbian | srp.traineddata | 
| srp_latn | Serbian - Latin | srp_latn.traineddata | 
| swa | Swahili | swa.traineddata | 
| swe | Swedish | swe.traineddata | 
| syr | Syriac | syr.traineddata | 
| tam | Tamil | tam.traineddata | 
| tel | Telugu | tel.traineddata | 
| tgk | Tajik | tgk.traineddata | 
| tgl | Tagalog | tgl.traineddata | 
| tha | Thai | tha.traineddata | 
| tir | Tigrinya | tir.traineddata | 
| tur | Turkish | tur.traineddata | 
| uig | Uighur; Uyghur | uig.traineddata | 
| ukr | Ukrainian | ukr.traineddata | 
| urd | Urdu | urd.traineddata | 
| uzb | Uzbek | uzb.traineddata | 
| uzb_cyrl | Uzbek - Cyrilic | uzb_cyrl.traineddata | 
| vie | Vietnamese | vie.traineddata | 
| yid | Yiddish | yid.traineddata | 
In Tesseract 3.0x Arabic and Hindi use the Cube OCR engine. You need to download the cube files and move them to the same folder where the <ara/hin>.traineddata file is located.
In Tesseract 4.0 the Cube OCR engine was removed from the codebase, so if you are using 4.0 or a newer version these files are not needed.
Hindi:
hin.cube.bigrams,
hin.cube.fold,
hin.cube.lm,
hin.cube.nn,
hin.cube.params,
hin.cube.word-freq,
hin.tesseract_cube.nn
Arabic:
ara.cube.bigrams,
ara.cube.fold,
ara.cube.lm,
ara.cube.nn,
ara.cube.params,
ara.cube.word-freq,
ara.cube.size,
ara.tesseract_cube.nn
These data files were prepared by @paalberti for some old versions of Tesseract. dan_frak, deu_frak and swe_frak were prepared for version 3.00,  slk_frak was prepared for 3.01. Updates to these files are available at paalberti/tesseract-dan-fraktur.
| Lang Code | Language | 3.0x traineddata | 
|---|---|---|
| dan_frak | Danish - Fraktur | dan_frak.traineddata | 
| deu_frak | German - Fraktur | deu_frak.traineddata | 
| slk_frak | Slovak - Fraktur | slk_frak.traineddata | 
| swe_frak | Swedish - Fraktur | swe-frak.traineddata | 
| Lang Code | Language | 2.0x traineddata | 
|---|---|---|
| deu | German | tesseract-2.00.deu.tar.gz | 
| deu-f | German - Fraktur | tesseract-2.01.deu-f.tar.gz | 
| eng | English | tesseract-2.00.eng.tar.gz | 
| eus | Basque | tesseract-2.04-eus.tar.gz | 
| fra | French | tesseract-2.00.fra.tar.gz | 
| ita | Italian | tesseract-2.00.ita.tar.gz | 
| nld | Dutch; Flemish | tesseract-2.00.nld.tar.gz | 
| por | Portuguese | tesseract-2.01.por.tar.gz | 
| spa | Spanish; Castilian | tesseract-2.00.spa.tar.gz | 
| vie | Vietnamese | tesseract-2.01.vie.tar.gz | 
The traineddata file for each language is an archive file in a Tesseract specific format. It contains several uncompressed component files which are needed by the Tesseract OCR process. The program combine_tessdata is used to create a tessdata file from the component files and can also extract them again like in this example:
combine_tessdata -u test/tessdata/eng.traineddata eng.
Extracting tessdata components from test/tessdata/eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.