| RU  |  EN  | 

Spoken corpus of the dialects
of Khakas

The Spoken corpus of the dialects of Khakas contains transcribed annotated texts, synchronized with the sound. The texts were recorded during the XXI century with speakers of 1916-1985 years of birth in the various expeditions from Moscow to the Repuplic of Khakasia. All texts are translated to Russian. Texts were analized by the automatic parser, and then edited and synchronized with the sound with the help of ELAN software.

This corpus is related to the project «Electronic Corpus of Khakas language». There is more information about the aims and methods of the project, but only in Russian for now.



We use the symbols of the Cyrillic khakas alphabet for transcription of our texts because the parser was made for the analisys of Literary Khakas. Regular phonetic dialect features are mainly ignored, but we tried to show the mophology and morphonology features.


In the layer with morpheme dividing of the wordforms the stems are written in phonology transcription, and the affixes are written in morphonology transcription. For example, туралар ‘houses’ look like тура-ЛАр when divided to morphemes. The plural marker has allomorphs лар, лер, нар, нер, тар, тер, and in the glossing layes there is one marker ЛАр. Morphonemes without alternations are written with the same symbols as phonemes.

See more about it in the paper: Anna V. Dybo, Philip S. Krylov, Vera S. Maltseva, Aleksandra V. Sheimovich. Segmental rules in the automatic parser for the Khakass corpus. In: Ural-Altaoc studies. N 1 (32), 2019. P. 48-69 (in Russian) https://iling-ran.ru/library/ural-altaic/ua2019_32.pdf

Consonant morphonemesVocal morphonemes
П: б/п/мА: е/а
К: ғ/г/х/кЫ: i/ы
Г: ғ/г/х/к/ØО: о/ö
Т: т/д
Д: т/д/н
С: с/з
Л: л/н/т
L: л/н
Н: н/т
Ч: ч/ҷ


This corpus belongs to a group of corpora that are built on the search platform tsakorpus. A more general instruction with common technical properties of these corpora can be found in the “Help” section (look for the button marked with a question mark in the top right corner of the search page). The current text describes those rules and conventions that are specific for this Corpus.

Searching for specific word forms («Word»)

In this field, you can enter specific word forms that you want to find.

For example, ибде ‘at home’, килген ‘he came’.

Searching for specific lemmas («Lemma»)

This field should be used if you need to find all forms of a given word (a.k.a. lemma, or lexeme).

For example, if you enter “иб” ‘home’, the search results will show all sentences where any form of this noun is used, e.g., иб ‘home’, ибні [home-ACC] ‘home (direct object)’, ибінде [home-3pos-Loc] ‘at his home’, etc.

Lemmas should be entered in this field in their base form, that is, in the same form which is also used in dictionaries. For nouns, adjectives, adverbs, pronouns and numerals, identification of the base form is the same as stem (e.g., иб ‘house’, кічіг ‘small’, ам ‘now’, син ‘you’, ікі ‘two’). For verbs, in accordance with the lexicographic practice, the infinitive form with the suffixes ArGA is used, e.g., тоғынарға ‘to work’, килерге ‘to come’ etc.

For the case forms, including locatives, of pronouns ол ‘this’ and пу ‘that’ are used lemmmas ол and пу, although in the dictionnaries their case forms are written as separate lexemes. We consider substantivized forms with 3 rd person possessive marker ан(ы)зы, мын(ы)зы, пунызы as separate lexemes. All the forms of the personal pronouns (with the same person and number) also relate to one lemma: мин ‘I’, син ‘you’, олар ‘they’, etc.


This field can be used for building search queries based on part of speech tags and grammatical categories. In order to use this search field, you should press the button immediately to the right of the “Grammar” search field itself; when you press this button, you will see a pop-up window where you can choose from the available grammatical tags. If you want to select a marker, you should click the left mouse button on it, and it will lighten. To cancel the selection of a marker you should click again, and the lightening will turn off.

The parts of speach markerks used in our corpus are explained in the following table.

Parts of speech

v – verb (including participle and converb), takes all the inflective markers.

n – nominal (noun, adjective, pronoun, numeral, postposition), doesn’t take negation, time, aspect and mood markers.

Some nominals doesn’t take case markers but take personal markers (ex. осхас ‘recembling’). We want to unite such lexemes in one separate part of speech after a corpus research.

i1 – invariable, which can combine with endoclitics (including particle -ох/-ӧх/-ӧк, which absorbs the last vovel of the stem). For example, піди ‘так’. This category unite the most part of adverbs, icluding the grammaticalized forms of converbs.

i – invariable, which can’t combine with endoclitics (particle, conjunction, interjection)


Both in “Gloss” field and in “Grammar” field you can only search for the forms with non-zero markers. The only exclusion is the imperative singular form which is a bare form of the verb. One can find it selecting in “Grammar” field meanings “imp” and “2sg”.

Marker with a number 1 or 2 (excluding “dur1”) are situated nearer to the stem then the same markers without numbers, and are used mainly as word formative markers. (Markers with number 1 are nearer to the stem then markers with number 2.)

PlЛАрnon-predicative pluralиблер ‘homes’, парғаннарына ‘for those who went’
PredPlЛАрpredicative pluralпарғаннар ‘they went’

Inner cases
Gen1НЫң, ДЫңgenitiveпістіңнер ‘ours’
Loc1ТАlocativeаалдағылар ‘those who are (living) in village’

“All1” and “Abl1” are very rare, you can only find them in some grammaticalized forms combinied with other cases.

The combination of “Gen1” and “3pos” synchronically is a cumulative marker ни (dialectal variant Ди), therefore we divide them not by hyphen but a dot. Example: сілерни / сілерди ‘yours’.


All the cases have the allomorphes which are used with the possessive singular markers.

Most cases have the dialectal variants. The ablative and the instumental cases use one morpheme in some dialects.

AccНЫ, ДЫ, нaccusativeсуғны / суғды ‘(drink) water’, суғын ‘(drink) his water’
GenНЫң, ДЫң, нЫңgenitiveазахтың ‘of leg’, азағының ‘of his leg’
DatГА, (н)Аdativeирге ‘to a man’, иріме ‘to my husband’
LocТА, (н)ТАlocativeибде ‘in the house’, ибінде ‘in his house’
AllСАр, СА, САрЫ, нСАр, (н)СА, (н)САрЫallativeибзер/ ибзері / ибзе ‘towards a house’, ибінзер /ибінзері / ибінзе ‘towards his house’
AblДАң, нАңablativeаалнаң / аалдаң ‘from a village’, аалынаң ‘from his village’
InstrДАң, НАң, нАң, БАң, (н)БАң, мАң, (н)мАңinstrumentalмалтынаң / малтыдаң / малтыбаң ‘by an axe’, абамнаң / абаммаң ‘with my dad’
ProlЧА, (н)ЧАprolative (equative)чолӌа ‘on a road’, соонӌа ‘following him’
DelibнАңАр, ДАңАр(Ы)deliberativeаннаңар ‘because’, кибірлердеңері ‘about the traditions’

Possessive markers
1pos.sg(Ы)м1st person singular possession (‘I’)хызым ‘my daughter’
1pos.pl(Ы)ПЫс1st person plural possession (‘us’)хызыбыс ‘our daughter’
2pos.sg(Ы)ң2nd person singular possession (‘you’)2nd person singular possession (‘you’)
2pos.pl(Ы)ңар2nd person plural possession (‘you’)іӌеңер ‘your mother’
3pos(з)Ы3rd person possession (‘he’, ‘she’, ‘it’, ‘they’)аал пазы ‘village’s beginning’
3pos1(з)Ы3rd person possession (inner position)аал пазындағылар ‘those who are (living) in the beginning of the village’

Perf(Ы)бЫсperfectiveпарыбысхан ‘he’s gone’
Perf0(Ы)сperfective near the particleчоохтаныпласчам ‘I speak almost every time’
Prosp.dialАК, иКprospectiveпарахча ‘is going to go’
DurчАтdurativeполчатсын ‘let it be’
Dur1А(р), и(р), итdurative / present for the verbs парарға ‘go’, килерге ‘come’кили ‘comes now’
IterАдЫр, идЫрiterative / presentтідирлер ‘they say’

RPastТЫrecent pastкилді ‘came (not long ago)’
PresчАpresentузупча ‘he sleeps’
IndirТЫрevidential (indirective)партыр ‘he went (they say)’
Evidосхасevidential (analytical form)тіпчен осхас ‘he says (the speaker didn’t hear it himself)’
AffirmЧЫКaffirmative, subjuntive and other meaningsпарарӌых ‘would come (if smth happened)’
Impimperative; takes the special set of personal markersат ‘shoot’, парим ‘should I go’
CondСАconditionalчатса ‘if it lies’
OptГАйoptativeхалғай ‘let it be left’
Simul(А)АчЫКsimulative, converts the verb to a nominalталаачых ‘simulating fainting’


We do not distinguish participle and finite forms with the same morphemes.

PastГАнпрошедшее времяодырған ‘сидел’
PresPtчАнpresent participleхомай чуртапчан кізілер ‘badly living people’
PresPt1инpresent participle with the verbs пар ‘go’ and кил ‘come’сӱр парин остар ‘drive (as now)’
FutА(р), и(р)futureкилер ‘will come’
Neg.FutПАсnegative futureкилбес ‘will not come’
HabЧА(ң)habitual (past as finite form and present as non-finite form)тоғынӌаң ‘worked (usually)’
AssumГАдАГassumptive («it seems that…»)хайтпаадағ ‘won’t happen (normally)’
CuncГАлАКcunctative («not yet…»)пысхалах ‘is not yet ripe’

ConvP(Ы)пconsequative converbалып алып, парыбысхан ‘having bought, went away’
ConvAА, иsimultanious converbчара парарға ‘to go separating’
Neg.ConvПи(н), ПААнnegative form of converbхурғатпин тартырарға ‘to grind without drying’

Person markers
1sg(Ы)м, СЫм, ПЫн, им1st singular person markerпарам ‘I will go’
1plПЫс, иБЫс1st plural person markerпарарбыс ‘we’ll go’
2sg(Ы)ң, СЫң2nd singular person markerпарғаң ‘you went’
2plңар, САр, (Ы)ңАр2nd plural person markerпарғазар ‘you (pl) went’
3Ø, СЫн3 rd person marker (marked form only with imperative; it’s not possible to distinguish zero marker and the absence of marker in the word automatically)ползын ‘let it be’
1.inclАңinclusive imperative singular («I and you (sg)»)параң ‘let’s (two of us) go!’
1pl.inclАңАр, АлАрinclusive imperative plural («I and you (pl)»)параңар / паралар ‘let’s (all) go!’

NegПАnegationпарба ‘don’t go’

Distr(К)лАdistributiveтастағлаабыс ‘we throught (many things)’
NFØ / (Ы)пword-formative marker from ConvP, which is used in some syntheical and analytical formsпар-Ø-ча ‘goes’, сана-п-ча ‘counts’
Complтіпcomplementizer (separate word)парғам чаблах одалирға тіп ‘I went to dig potatoes’

Word formation

All the word formative markers are not divided from the stem by hyphen, so they can be found only with the search in “Grammar” field.

Nominal word formation
AttrКЫattributivizer (of locative and temporal forms)аалдағы ‘situated in village’, пурунғы ‘prior’
AdvЛиadjectivizerполосали ‘by strikes’
ComitЛЫГcomitative («with…, «having…»)тадылығ ‘tasty’, аттығ ‘on a horse, with a horse’
Dimin(Ы)ӌАКdiminutiveхызыӌах ‘(small) girl’

Numeral word formation
CollОлАң, АлАңcollective numeralікӧлең / ікелең ‘twosome, two together’
DistrАрdistributive numeralпизер ‘by five’

Verb word fomation (voices)
Causт, тЫрcausative (also used as parrive)итірбе ‘don’t do (with the help of other)’
Pass(Ы)лpassiveсалылған ‘(been) put’
Refl(Ы)нreflexiveчоохтанча ‘says’
Rec(Ы)сreciprocылғазып ‘crying together’


Many of endoclitical particles are written as the separate words in the Khakas orthography, thought many of them have regular phonetical alternations, and some of them are used as enclitics.

Qпа, пе, ма, ме, ба, беgeneral question particleпарған ма? ‘(he) came?’
qpartчиquestion particleа тігілер чи? ‘and they?’
FocТЫрfocus particlesадың кемдір? ‘what’s your name?’
Magnreduplication of the 1st syllable + пhigh degree, superlativeтап-тадылығ ‘very tasty’
Emphза, зе, нооза, нізе, and otheremphatic particleылғапча нізе ‘cryes indeed’
Confpartізеconfirmative particle“ізе” тіпче ‘says «yes»’
IndefТА, тАindefinite pronoun particleхайдағ-да / хайдағ-та ‘some’
AssОКassociativeпарохтар ‘they are (there) too’
ContLAcontinuativeхырарлача ‘reddens all the time’
AddТААadditive particleмин дее ‘even me’
PrecТАКprecative particle (polite request in some dialects)пирдек ‘give, please’


The field “Gloss” allows to submit search queries that concern the morphemic structure of the word forms. In general, this type of search is functionally similar to the search in the field “Grammar”. In particular, the list of markers that can be viewed by clicking the button next to the field “Gloss” largely overlaps with those given in the field “Grammar”.

The general principle of search by a gloss and the major differences of this type of search from the grammatical search are described in the “Help” section (the button with a question mark in the top right corner of the search window). The key features of the gloss-based search that are specific for this corpus are given below.

All the dialectal markers has the label .dial, both the morpheme variants (Acc.dial) and the markers which are not used in literary Khakas (Prosp.dial).

Gloss-based search does not include the word forms where there is no morphemic border between the marker in question and the stem. For instance, the dative case form of the pronoun син ‘you’ is сегее / сағаа / сее, which is not segmentable into morphemes and glossed “you.DAT”. This form will be among the hits of the grammatical query “dat”, but will not be included in the occurrences corresponding to the gloss-based search “STEM-DAT”.

The gloss-based query can be constructed using either specific glosses or the options CASE, CASE1, POSS, PRTCP, CONV, PERSON. These labels specify a group of morphemes rather than a specific gloss. CASE stands for any case marker, CASE1 stands for any case marker in inner position, POSS stands for any possessive marker, PRTCP stands for and participle marker, CONV stands for any converb marker, and PERSON stands for any person marker.

Corpus Composition

The corpus contains:

- 23 texts of Askiz dialect, collected in village Kazanovka in 2001-2002 during the expedition of the Linguistic Department of th Russian University State for Humanities, headed by Nina Sumbatova. It contains about 13 000 tokens, the duration is 2h 18 min.

- 27 texts of Belty dialect, collected in 2011 in villages Butrachty, Chylany, Karagay by Anna Dybo and Elvira Kyrzhinakova. It contains about 45 000 tokens, the duration is 9 h 22 min.

Texts on other dialects (Kacha, Kyzyl, Shor) will be added.

Project participants

The processing of the texts for inclusion in the corpus was conducted in 2017. This work was carried out by Vera Maltseva.

The Spoken corpus of the dialects of Khakas is a project supported by the Linguistic Convergence Laboratory of the National Research University Higher School of Economics. It is conducted within the framework of the Basic Research Program at the National Research University 'Higher School of Economics' (HSE) and supported as part of the Russian Academic Excellence Project '5-100'.

This project is a part of a large project on the documentation of the Khakas language http://khakas.altaica.ru. The following people are invoved in:

Anna Dybo – project supervizer, automatic parser creation
Elvira Sultrekova (Kyrzhinakova) – transcription and translation of the major part of texs (text collected in Kazanovka village in 2001-2002 were mostly transcripted and tranlated by the members of expedition with the help of the people of the village).
Alexandra Sheymovich – dictionary (conversion The Khakas-Russian dictionary (Novosibirsk, 2006, 220 th. of words) into Starling based electronic dictionary)
Phil Krylov – programming of the automatic parser
Vera Maltseva – automatic parser creation, text processing in Elan (correction of the parser’s results, annotation of sound)
Elena Tenkova – macros for writing of results of automatic parcing of texts in Elan files


You may contact us with questions about the Corpus:
Vera Maltseva: malt.wh@gmail.com

Or with questions about the search platform:
Elena Sokur: elena.o.sokur@gmail.com

How to cite the corpus

If you use data from the Spoken corpus of the dialects of Khakas in your research, please cite as follows:

Vera Maltseva, Elena Sokur. Spoken corpus of the dialects of Khakas. Moscow: Institute of Linguistics; Moscow: Linguistic Convergence Laboratory, NRU HSE. (Available online at https://linghub.ru/oral_khakas_corpus/, accessed on .)