langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

All the data in the lang directory.
The corresponding unicharset/xheights files for the script(s) used by lang.
All the remaining non-lang-specific files in the top-level directory, such as font_properties.
You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.

Name	Name	Last commit message	Last commit date
Latest commit History 143 Commits 143 Commits
afr	afr
akk	akk
amh	amh
ara	ara
asm	asm
aze	aze
aze_cyrl	aze_cyrl
bel	bel
bel_tarask	bel_tarask
ben	ben
bih	bih
bod	bod
bos	bos
bul	bul
cat	cat
ceb	ceb
ces	ces
chi_sim	chi_sim
chi_sim_vert	chi_sim_vert
chi_tra	chi_tra
chi_tra_vert	chi_tra_vert
chr	chr
cym	cym
dan	dan
deu	deu
deu_latf	deu_latf
div	div
dzo	dzo
ell	ell
eng	eng
enm	enm
epo	epo
est	est
eus	eus
fas	fas
fin	fin
fra	fra
frm	frm
gle	gle
gle_uncial	gle_uncial
glg	glg
grc	grc
guj	guj
hat	hat
heb	heb
hin	hin
hrv	hrv
hun	hun
iast	iast
iku	iku
ind	ind
isl	isl
ita	ita
ita_old	ita_old
jav	jav
jpn	jpn
jpn_vert	jpn_vert
kan	kan
kat	kat
kat_old	kat_old
kaz	kaz
khm	khm
kir	kir
kmr	kmr
kor	kor
kur_ara	kur_ara
lao	lao
lat	lat
lav	lav
lit	lit
mal	mal
mar	mar
mkd	mkd
mlt	mlt
mri	mri
msa	msa
mya	mya
nep	nep
nld	nld
nor	nor
ori	ori
pan	pan
pol	pol
por	por
pus	pus
ron	ron
rus	rus
rus_accent	rus_accent
san	san
sin	sin
slk	slk
slv	slv
snd	snd
spa	spa
spa_old	spa_old
sqi	sqi
srp	srp
srp_latn	srp_latn
swa	swa
swe	swe

Provide feedback