| RU  |  EN  | 

Spoken corpus of the dialects
of Khakas

The Spoken corpus of the dialects of Khakas contains transcribed annotated texts, synchronized with the sound. The texts were recorded during the 21st century with speakers born in 1916-1985 in different expeditions from Moscow to the Republic of Khakassia. All texts are translated to Russian. Texts were analyzed using the automatic parser, and then edited and synchronized with the sound with the help of the ELAN software.

This corpus is related to the project «Electronic Corpus of Khakas language». Follow the link to see more information about the aims and methods of the project (Russian only, for now).



We use the symbols of the Cyrillic Khakas alphabet for transcription of our texts because the parser was made with the goal of analyzing Literary Khakas. Regular phonetic dialect features are mainly ignored, but we tried to show the morphology and morphonology features.


В строке разбиения на морфемы основы представлены в фонологической записи (транскрипции), а аффиксы – в морфонологической. Каждый аффикс представлен в единой форме, объединяющей его алломорфы с регулярными чередованиями. Например, словоформа туралар ‘дома’ в строке разбиения на морфемы имеет вид тура‑ЛАр. Показатель множественного числа имеет алломорфы ‑лар, ‑лер, ‑нар, ‑нер, ‑тар, ‑тер, а в строке глоссирования этому соответствует одна морфема ‑ЛАр. Морфонемы без чередований записываются теми же символами, что и фонемы.

See more on the subject in the paper: Anna V. Dybo, Philip S. Krylov, Vera S. Maltseva, Aleksandra V. Sheimovich. Segmental rules in the automatic parser for the Khakass corpus. In: Ural-Altaoc studies. N 1 (32), 2019. P. 48-69 (in Russian) https://iling-ran.ru/library/ural-altaic/ua2019_32.pdf

Consonant morphonemesVocal morphonemes
П: б/п/мА: е/а
К: ғ/г/х/кЫ: i/ы
Г: ғ/г/х/к/ØО: о/ö
Т: т/д
Д: т/д/н
С: с/з
Л: л/н/т
L: л/н
Н: н/т
Ч: ч/ҷ


This corpus belongs to a group of corpora that are built using the search platform tsakorpus. A more general instruction with common technical properties of these corpora can be found in the “Help” section (look for the button marked with a question mark in the top right corner of the search page). The current text describes the rules and conventions that are specific for this Corpus.

Searching for specific word forms («Word»)

In this field, you can enter specific word forms that you want to find.

For example, ибде ‘at home’, килген ‘he came’.

Searching for specific lemmas («Lemma»)

This field should be used if you need to find all forms of a given word (lexeme or lemma).

For example, if you enter “иб” ‘home’, the search results will show all sentences where any form of this noun is used, e.g., иб ‘home’, ибні [home-ACC] ‘home (direct object)’, ибінде [home-3pos-Loc] ‘at his home’, etc.

Lemmas should be entered in this field in their base form, that is, in the same form which is used in dictionaries. For nouns, adjectives, adverbs, pronouns and numerals, base form is the same as stem (e.g., иб ‘house’, кічіг ‘small’, ам ‘now’, син ‘you’, ікі ‘two’). For verbs, in accordance with the lexicographic practice, the infinitive form with the suffixes Ar GA is used, e.g., тоғынарға ‘to work’, килерге ‘to come’ etc.

For the case forms, including locatives, of pronouns ол ‘this’ and пу ‘that’, lemmmas ол and пу are used, although in the dictionnaries their case forms are written as separate lexemes. The only exceptions are substantivized forms with 3rd person possessive marker ан(ы)зы, мын(ы)зы, пунызы which we consider to be separate lexemes. All the forms of the personal pronouns (with the same person and number) also relate to one lemma: мин ‘I’, син ‘you’, олар ‘they’, etc.


This field can be used for building search queries based on part of speech tags and grammatical categories. In order to use this search field, you should press the button immediately to the right of the “Grammar” search field itself; when you press this button, you will see a pop-up window where you can choose from the available grammatical tags. If you want to select a marker, you should click the left mouse button on it, and it will lighten. To cancel the selection of a marker you should click again, and the lightening will turn off.

The parts of speach markerks used in our corpus are explained in the following table.

Parts of speech

a verb (including participle and converb), takes all the inflective markers.

a nominal (noun, adjective, pronoun, numeral, postposition), doesn’t take negation, time, aspect and mood markers.

Some nominals don’t take case markers but take personal markers (ex. осхас ‘resembling’). We want to unite such lexemes in one separate part of speech after a corpus research.

i1 – an invariable which can combine with endoclitics (including particle -ох/-ӧх/-ӧк, which absorbs the last vowel of the stem). For example, піди ‘так’. This category unites most adverbs, including the grammaticalized forms of converbs.

i – an invariable which can’t combine with endoclitics (particle, conjunction, interjection)


Both in “Gloss” field and in “Grammar” field you can only search for the forms with non-zero markers. The only exclusion is the imperative singular form which is a bare form of the verb. One can find it by selecting options “imp” and “2sg” in the “Grammar” field.

Marker with a number 1 or 2 (excluding “dur1”) are situated nearer to the stem then the same markers without numbers, and are used mainly as word formative markers. (Markers with number 1 are nearer to the stem then markers with number 2.)

PlЛАрnon-predicative pluralиблер ‘homes’, парғаннарына ‘for those who went’
PredPlЛАрpredicative pluralпарғаннар ‘they went’

Inner cases
Gen1НЫң, ДЫңgenitiveпістіңнер ‘ours’
Loc1ТАlocativeаалдағылар ‘those who are (living) in village’

“All1” and “Abl1” are very rare, you can only find them in some grammaticalized forms combined with other cases.

The combination of “Gen1” and “3pos” synchronically is a cumulative marker ни (dialectal variant Ди), therefore we divide them not by hyphen but a dot. Example: сілерни / сілерди ‘yours’.


All cases have allomorphs which are used with the possessive singular markers.

Most cases have dialectal variants. The ablative and the instrumental cases use one morpheme in some dialects.

AccНЫ, ДЫ, нaccusativeсуғны / суғды ‘(drink) water’, суғын ‘(drink) his water’
GenНЫң, ДЫң, нЫңgenitiveазахтың ‘of leg’, азағының ‘of his leg’
DatГА, (н)Аdativeирге ‘to a man’, иріме ‘to my husband’
LocТА, (н)ТАlocativeибде ‘in the house’, ибінде ‘in his house’
AllСАр, СА, САрЫ, нСАр, (н)СА, (н)САрЫallativeибзер/ ибзері / ибзе ‘towards a house’, ибінзер /ибінзері / ибінзе ‘towards his house’
AblДАң, нАңablativeаалнаң / аалдаң ‘from a village’, аалынаң ‘from his village’
InstrДАң, НАң, нАң, БАң, (н)БАң, мАң, (н)мАңinstrumentalмалтынаң / малтыдаң / малтыбаң ‘by an axe’, абамнаң / абаммаң ‘with my dad’
ProlЧА, (н)ЧАprolative (equative)чолӌа ‘on a road’, соонӌа ‘following him’
DelibнАңАр, ДАңАр(Ы)deliberativeаннаңар ‘because’, кибірлердеңері ‘about the traditions’

Possessive markers
1pos.sg(Ы)м1st person singular possession (‘I’)хызым ‘my daughter’
1pos.pl(Ы)ПЫс1st person plural possession (‘us’)хызыбыс ‘our daughter’
2pos.sg(Ы)ң2nd person singular possession (‘you’)2nd person singular possession (‘you’)
2pos.pl(Ы)ңар2nd person plural possession (‘you’)іӌеңер ‘your mother’
3pos(з)Ы3rd person possession (‘he’, ‘she’, ‘it’, ‘they’)аал пазы ‘village’s beginning’
3pos1(з)Ы3rd person possession (inner position)аал пазындағылар ‘those who are (living) in the beginning of the village’

Perf(Ы)бЫсperfectiveпарыбысхан ‘he’s gone’
Perf0(Ы)сperfective near the particleчоохтаныпласчам ‘I speak almost every time’
Prosp.dialАК, иКprospectiveпарахча ‘is going to go’
DurчАтdurativeполчатсын ‘let it be’
Dur1А(р), и(р), итdurative / present for the verbs парарға ‘go’, килерге ‘come’кили ‘comes now’
IterАдЫр, идЫрiterative / presentтідирлер ‘they say’

RPastТЫrecent pastкилді ‘came (not long ago)’
PresчАpresentузупча ‘he sleeps’
IndirТЫрevidential (indirective)партыр ‘he went (they say)’
Evidосхасevidential (analytical form)тіпчен осхас ‘he says (the speaker didn’t hear it himself)’
AffirmЧЫКaffirmative, subjuntive and other meaningsпарарӌых ‘would come (if smth happened)’
Impimperative; takes the special set of personal markersат ‘shoot’, парим ‘should I go’
CondСАconditionalчатса ‘if it lies’
OptГАйoptativeхалғай ‘let it be left’
Simul(А)АчЫКsimulative, converts the verb to a nominalталаачых ‘simulating fainting’


We do not distinguish participle and finite forms with the same morphemes.

PastГАнпрошедшее времяодырған ‘сидел’
PresPtчАнpresent participleхомай чуртапчан кізілер ‘badly living people’
PresPt1инpresent participle with the verbs пар ‘go’ and кил ‘come’сӱр парин остар ‘drive (as now)’
FutА(р), и(р)futureкилер ‘will come’
Neg.FutПАсnegative futureкилбес ‘will not come’
HabЧА(ң)habitual (past as finite form and present as non-finite form)тоғынӌаң ‘worked (usually)’
AssumГАдАГassumptive («it seems that…»)хайтпаадағ ‘won’t happen (normally)’
CuncГАлАКcunctative («not yet…»)пысхалах ‘is not yet ripe’

ConvP(Ы)пconsequative converbалып алып, парыбысхан ‘having bought, went away’
ConvAА, иsimultanious converbчара парарға ‘to go separating’
Neg.ConvПи(н), ПААнnegative form of converbхурғатпин тартырарға ‘to grind without drying’

Person markers
1sg(Ы)м, СЫм, ПЫн, им1st singular person markerпарам ‘I will go’
1plПЫс, иБЫс1st plural person markerпарарбыс ‘we’ll go’
2sg(Ы)ң, СЫң2nd singular person markerпарғаң ‘you went’
2plңар, САр, (Ы)ңАр2nd plural person markerпарғазар ‘you (pl) went’
3Ø, СЫн3rd person marker (marked form only with imperative; it’s not possible to distinguish zero marker and the absence of marker in the word automatically)ползын ‘let it be’
1.inclАңinclusive imperative singular («I and you (sg)»)параң ‘let’s (two of us) go!’
1pl.inclАңАр, АлАрinclusive imperative plural («I and you (pl)»)параңар / паралар ‘let’s (all) go!’

NegПАnegationпарба ‘don’t go’

Distr(К)лАdistributiveтастағлаабыс ‘we throught (many things)’
NFØ / (Ы)пword-formative marker from ConvP, which is used in some syntheical and analytical formsпар-Ø-ча ‘goes’, сана-п-ча ‘counts’
Complтіпcomplementizer (separate word)парғам чаблах одалирға тіп ‘I went to dig potatoes’

Word formation

All the word formative markers are not divided from the stem by hyphen, so they can be found only with the search in “Grammar” field.

Nominal word formation
AttrКЫattributivizer (of locative and temporal forms)аалдағы ‘situated in village’, пурунғы ‘prior’
AdvЛиadjectivizerполосали ‘by strikes’
ComitЛЫГcomitative («with…, «having…»)тадылығ ‘tasty’, аттығ ‘on a horse, with a horse’
Dimin(Ы)ӌАКdiminutiveхызыӌах ‘(small) girl’

Numeral word formation
CollОлАң, АлАңcollective numeralікӧлең / ікелең ‘twosome, two together’
DistrАрdistributive numeralпизер ‘by five’

Verb word fomation (voices)
Causт, тЫрcausative (also used as parrive)итірбе ‘don’t do (with the help of other)’
Pass(Ы)лpassiveсалылған ‘(been) put’
Refl(Ы)нreflexiveчоохтанча ‘says’
Rec(Ы)сreciprocылғазып ‘crying together’


Many of endoclitical particles are written as separate words in the Khakas orthography, though many of them have regular phonetical alternations, and some of them are used as enclitics.

Qпа, пе, ма, ме, ба, беgeneral question particleпарған ма? ‘(he) came?’
qpartчиquestion particleа тігілер чи? ‘and they?’
FocТЫрfocus particlesадың кемдір? ‘what’s your name?’
Magnreduplication of the 1st syllable + пhigh degree, superlativeтап-тадылығ ‘very tasty’
Emphза, зе, нооза, нізе, and otheremphatic particleылғапча нізе ‘cryes indeed’
Confpartізеconfirmative particle“ізе” тіпче ‘says «yes»’
IndefТА, тАindefinite pronoun particleхайдағ-да / хайдағ-та ‘some’
AssОКassociativeпарохтар ‘they are (there) too’
ContLAcontinuativeхырарлача ‘reddens all the time’
AddТААadditive particleмин дее ‘even me’
PrecТАКprecative particle (polite request in some dialects)пирдек ‘give, please’


The field “Gloss” allows to submit search queries that concern the morphemic structure of the word forms. In general, this type of search is functionally similar to the search in the field “Grammar”. In particular, the list of markers that can be viewed by clicking the button next to the field “Gloss” largely overlaps with those given in the field “Grammar”.

The general principle of search by a gloss and the major differences of this type of search from the grammatical search are described in the “Help” section (the button with a question mark in the top right corner of the search window). The key features of the gloss-based search that are specific for this corpus are given below.

All the dialectal markers have the .dial label, both the morpheme variants (Acc.dial) and the markers which are not used in literary Khakas (Prosp.dial).

Gloss-based search does not include the word forms where there is no morphemic border between the marker in question and the stem. For instance, the dative case form of the pronoun син ‘you’ is сегее / сағаа / сее, which is not segmentable into morphemes and glossed “you.DAT”. This form will be among the hits of the grammatical query “dat”, but will not be included in the occurrences corresponding to the gloss-based search “STEM-DAT”.

The gloss-based query can be constructed using either specific glosses or the options CASE, CASE1, POSS, PRTCP, CONV, PERSON. These labels specify a group of morphemes rather than a specific gloss. CASE stands for any case marker, CASE1 stands for any case marker in inner position, POSS stands for any possessive marker, PRTCP stands for any participle marker, CONV stands for any converb marker, and PERSON stands for any person marker.

Corpus Composition

The corpus contains:

- 23 texts of Askiz dialect, collected in the Kazanovka village in 2001-2002 during the expedition of the Linguistic Department of the Russian State University for Humanities, headed by Nina Sumbatova. It contains about 13 000 tokens, the duration is 2h 18 min.

- 27 texts of Belty dialect, collected in 2011 in villages Butrachty, Chylany, Karagay by Anna Dybo and Elvira Kyrzhinakova. It contains about 45 000 tokens, the duration is 9 h 22 min.

Texts on other dialects (Kacha, Kyzyl, Shor) will be added.

Project participants

The processing of the texts for inclusion in the corpus was conducted in 2017. This work was carried out by Vera Maltseva.

The Spoken corpus of the dialects of Khakas is a project supported by the Linguistic Convergence Laboratory of the National Research University Higher School of Economics. It is conducted within the framework of the Basic Research Program at the National Research University 'Higher School of Economics' (HSE) and supported as part of the Russian Academic Excellence Project '5-100'.

This project is a part of a large project on the documentation of the Khakas language http://khakas.altaica.ru. The following people are involved:

Anna Dybo – project supervisor, automatic parser creation
Elvira Sultrekova (Kyrzhinakova) – transcription and translation of most texts (text collected in Kazanovka village in 2001-2002 were mostly transcribed and translated by the members of expedition with the help of the people of the village).
Alexandra Sheymovich – dictionary (conversion of The Khakas-Russian dictionary
Phil Krylov – programming of the automatic parser
Vera Maltseva – automatic parser creation, text processing in Elan (correction of the parser’s results, annotation of sound)
Elena Tenkova – macros for writing of results of automatic parsing of texts in Elan files


You may contact us with questions about the Corpus:
Vera Maltseva: malt.wh@gmail.com

Or with questions about the search platform:
Elena Sokur: elena.o.sokur@gmail.com

How to cite the corpus

If you use data from the Spoken corpus of the dialects of Khakas in your research, please cite as follows:

Vera Maltseva, Elena Sokur. Spoken corpus of the dialects of Khakas. Moscow: Institute of Linguistics; Moscow: Linguistic Convergence Laboratory, NRU HSE. (Available online at https://linghub.ru/oral_khakas_corpus/, accessed on .)