Vaksancayah - Sanskrit Speech Corpus

Vāksañcayaḥ is a Sanskrit speech corpus

Vāksañcayaḥ is a Sanskrit speech corpus that has more than 78 hours of data and contains recordings of 45,953 sentences with a sampling rate of 22 KHz. The content is mainly readings of various texts spanning many Śāstras(domains) of Sanskrit literature and also includes contemporary stories, radio programs, extempore discourse, etc, making the dataset to be diverse both chronologically and in terms of the domain coverage. The software oTranscribe was used to transcribe the audio files and several rounds of cleansing were also done meticulously to check the sentence boundary matching, correctness of the transcripts, etc. For 18 unique speakers, the content was collected online, and 9 volunteers were involved in the recording. Using this corpus an Automatic Speech Recognition(ASR) system for Sanskrit was built and published in an A* conference, "59th Annual Meeting of the Association for Computational Linguistics(ACL)", in 2021. Reference : https://www.cse.iitb.ac.in/~asr/, https://arxiv.org/abs/2106.05852.

Prof. Ramasubramanian

Humanitites and Social Sciences

Type Of IP Licensing

Know how

Software

Industrial Research And Consultancy Centre

Address