Speech

Speaker Diarization: From Traditional Methods to the Modern Models

Speaker Diarization, the task of answering “Who spoken when?” - is an crucial component in many speech processing systems. From meeting transcription to customer service call analysis, diarization allows to segment signal by speakers, making down-stream tasks like speech-to-text, emotion analysis, or intent identification much more effective.

Apr 28, 2025 6 min read Speech, Speaker-Darization

Speaker Diarization: From Traditional Methods to the Modern Models

Why Entropy Matters in Machine Learning?

Low vs High Entropy Entropy is a powerful and fundamental concept that quietly drives some of the most effective algorithms in machine learning. From decision trees to deep neural networks, entropy plays a central role in helping models navigate uncertainty and make better predictions.

Apr 4, 2025 4 min read NLP, Speech, Machine Learning

Why Entropy Matters in Machine Learning?

LoRA-Whisper: A Scalable and Efficient Solution for Multilingual ASR

1. Background & Motivation Automatic Speech Recognition (ASR) has made significant strides in recent years, particularly with the rise of large-scale multilingual models like OpenAI’s Whisper, Google USM, and Meta’s MMS.

Mar 15, 2025 3 min read Speech, Automatic Speech Recognition

LoRA-Whisper: A Scalable and Efficient Solution for Multilingual ASR

Vietnamese Voice Conversion

Overview This thesis develops a voice conversion model for Vietnamese based on the Phoneme Hallucinator model with 2 adoptions: (1) Add a Text2SSL module to get more context information before performing the KNN algorithm, (2) To create a more diverse dataset we apply spectrogram-resize (SR) based data augmentation idea from Free-VC model which distorts speaker information without changing content information to generate more ”speakers”.

Minh Nguyen Le

Mar 9, 2024 1 min read Speech, speech-synthesis, voice-conversion

Vietnamese Voice Conversion

Generally speaking, the postnet layer receives a mel-spectrogram and predicts another mel-spectrogram with additional information. That makes the output mel-spectrogram more detail, and hence improves the quality of synthesis audio.

Minh Nguyen Le

Last updated on Aug 25, 2022 2 min read Speech, TTS

Postnet Layer