Published: 2022-03-14

The use of Russian-language internet news corpora for the purposes of automatic speech recognition systems in the area of the media monitoring

Daniel Borysowski Logo ORCID

Website: http://dborysowski.info

Abstract

The author of the article used open Internet-news corpuses (NewsRu and Taiga) to create N-gram language models for the needs of automatic speech recognition systems. The models were comprehensively evaluated (perplexity, WER, proper name recognition, comparison with the base model and Google ASR). The author also rescored N-gram models, using recursive neural networks. The effectiveness of the models was assessed by recognizing speech from the news channel Россия 24 (37 files with a total length of 1.5 hours were tested). The selection of test data is related to the main goal of the article – speech recognition for the needs of the so-called media monitoring.

Download files

Citation rules

Borysowski, D. (2022). The use of Russian-language internet news corpora for the purposes of automatic speech recognition systems in the area of the media monitoring. Przegląd Rusycystyczny [Russian Studies Review], (1(177). https://doi.org/10.31261/pr.12741

Cited by / Share

No. 1(177) (2022)
Published: 2022-03-14


ISSN: 0137-298X

Publisher
Polskie Towarzystwo Rusycystyczne oraz Wydawnictwo Uniwersytetu Śląskiego

This website uses cookies for proper operation, in order to use the portal fully you must accept cookies.