DISCUSSING PERSPECTIVES OF DEVELOPMENT OF AN OFFLINE UKRAINIAN-SPEAKING LLM-BASED ASSISTANT INTEGRATED WITH SPEECH SYNTHESIS & RECOGNITION TECHNOLOGIES

Authors

DOI:

https://doi.org/10.32689/maup.it.2025.1.10

Keywords:

large language model, speech synthesis, automatic speech recognition, voice assistant

Abstract

The purpose of the article is to analyze modern approaches and technical possibilities for implementing a fullfledged Ukrainian-speaking voice assistant to meet the need for autonomy, confidentiality, personalization and to ensure the flexibility of customization to the specific requirements of target consumers. The paper considers the problems of existing solutions on examples of well-known cloud platforms and emphasizes the need to develop independent and autonomous analogs. An analysis of known open-source projects for speech recognition, text-to-speech and human-like speech synthesis was conducted to identify those that provide high quality processing at relatively low resource costs, are capable of working with different languages, including Ukrainian, and can be used to implement a demo application. Methods and techniques for reducing the hardware requirements of the end system to ensure efficient operation of the system in resource-limited environments are analyzed. Particular attention was paid to optimizing the performance of mathematical operations of model inference by using the hardware acceleration capabilities of individual computing platforms. In addition, the peculiarities of integrating each component into a single system based on a microservice architecture with the ability to adapt these tools to the specific needs of users are considered. The scientific novelty lies in the systematization of information on technologies and tools suitable for creating such assistants without using cloud services and with a combination of various optimization techniques for deploying such systems on consumer-level devices. The article proves that if the components are combined with minimal interaction delays, performance and processing quality are improved without a significant increase in resource requirements, and the system is properly implemented to combine all components, a competitive autonomous voice assistant capable of being integrated into any system thanks to an open platform can be implemented.

References

Bernard M., Titeux H. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software. Vol. 6, Issue 68. P. 3958. DOI:10.21105/joss.03958.

Coqui TTS – deep learning toolkit for Text-to-Speech, battle-tested in research and production. GitHub. URL: https://github.com/coqui-ai/TTS.

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. GitHub. URL: https://github.com/mozilla/DeepSpeech.

eSpeak NG is an open source speech synthesizer that supports more than hundred languages and accents. GitHub. URL: https://github.com/espeak-ng/espeak-ng.

Gemma Team, Google Deepmind. Gemma 3 Technical Report. URL: https://storage.googleapis.com/deepmindmedia/gemma/Gemma3Report.pdf.

Kim S., Shih K. J., Badlani R. et al. P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting. Thirtyseventh Conference on Neural Information Processing Systems. (2023). URL: https://openreview.net/forum?id=zNA7u7wtIN.

Mistral AI Models Overview. Mistral AI Documentation. URL: https://docs.mistral.ai/getting-started/models/models_overview.

NVIDIA NeMo Framework ASR Models. NeMo Framework User Guide. URL: https://docs.nvidia.com/nemoframework/user-guide/latest/nemotoolkit/asr/models.html.

Shih K. J., Valle R., Badlani R. et al. RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models(2021). URL: https://openreview.net/forum?id=0NQwnnwAORi.

Speech Recognition & Synthesis for Ukrainian. GitHub. URL: https://github.com/egorsmkv/speech-recognition-uk.

StyleTTS2 Ukrainian Demo. Hugging Face. URL: https://huggingface.co/spaces/patriotyk/styletts2-ukrainian.

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. OpenAI. URL: https://openai.com/index/whisper.

whisper-small-uk-v2. Hugging Face. URL: https://huggingface.co/nikes64/whisper-small-uk-v2.

Baevski A., Hsu W.-N., Xu Q. et al. Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. URL: https://ai.meta.com/research/data2vec-a-general-framework-for-self-supervised-learning-inspeech-vision-and-language.

Chung Y.-A., Zhang Y., Han W. et al. W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. arXiv, 2021. DOI:10.48550/ARXIV.2108.06209.

Li Y. A., Han C., Raghavan V. S. et al. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv, 2023. DOI:10.48550/ARXIV.2306.07691.

Touvron H., Lavril T., Izacard G. et al. LLaMA: Open and Efficient Foundation Language Models. arXiv, 2023. DOI:10.48550/ARXIV.2302.13971.

Published

2025-05-28

How to Cite

ІВАНОВ, В., ГОБИР, Л., & ВАВРИК, Т. (2025). DISCUSSING PERSPECTIVES OF DEVELOPMENT OF AN OFFLINE UKRAINIAN-SPEAKING LLM-BASED ASSISTANT INTEGRATED WITH SPEECH SYNTHESIS & RECOGNITION TECHNOLOGIES. Information Technology and Society, (1 (16), 80-86. https://doi.org/10.32689/maup.it.2025.1.10