AI Speech Engineer Roadmap: From Zero to Production in 18 Months
The roadmap is also open-source at github.com/leminhnguyen/ai-speech-engineer-roadmap.

After 6 years working in AI Speech — from Vietnamese TTS research to production ASR/VC systems — I’ve distilled the path into a structured, opinionated roadmap. This post maps the entire journey: what to learn, in what order, and why.
Overview
The roadmap is divided into four phases totalling roughly 18 months of focused learning before entering a continuous research mode:
| Phase | Duration | Focus |
|---|---|---|
| #1 Foundations | 3 months | Math, Python, ML, Deep Learning, Signal Processing |
| #2 Tools & Frameworks | 3 months | Libraries, Audio Tools, Hugging Face |
| #3 Core Technologies | 12 months | ASR, TTS, Voice Conversion, Speaker Verification & Diarization |
| #4 Research Trends | Continuous | Audio-Language Models, LLM-era Speech |
The ordering is intentional. Speech AI is both signal processing and deep learning — skipping either foundation creates gaps that become painful later.
Phase 1 — Foundations (3 months)
Before touching a single audio file or model, these fundamentals are non-negotiable:
Python Start with a solid Python foundation. Corey Schafer’s YouTube series is beginner-friendly and covers everything from syntax to OOP.
Machine Learning Andrew Ng’s ML Specialization on Coursera gives the mathematical intuition behind supervised learning, gradient descent, and regularization — all of which reappear constantly in speech models.
Deep Learning 3Blue1Brown’s Neural Networks playlist is the best visual explanation of how neural networks actually work — backpropagation, gradient flow, and why the math makes sense.
Audio Signal Processing This is where speech diverges from general ML. Valerio Velardo’s Audio ML series covers the key representations you’ll use everywhere:
- Short-Time Fourier Transform (STFT)
- Mel spectrograms
- MFCCs (Mel-frequency cepstral coefficients)
- Waveform vs. frequency domain tradeoffs
Understanding why we use Mel spectrograms instead of raw waveforms — and when we do use raw waveforms — is fundamental to understanding modern model architectures.
Phase 2 — Tools & Frameworks (3 months)
With foundations solid, the next step is getting fluent with the ecosystem:
Core Libraries
| Library | Purpose |
|---|---|
PyTorch | Model training, custom architectures |
librosa | Audio loading, STFT, MFCCs, feature extraction |
torchaudio | GPU-accelerated audio transforms, pretrained model wrappers |
ffmpeg / sox / pydub | Audio conversion, resampling, slicing, format handling |
noisereduce | Quick noise reduction for data preprocessing |
Audacity — Free, cross-platform audio editor. Invaluable for inspecting audio files, checking sample rates, visualizing spectrograms, and debugging data issues.
Hugging Face The HuggingFace Audio Course is the fastest way to understand how to load speech datasets, fine-tune models, and push to the Hub. Most modern speech models are hosted here.
Phase 3 — Core Technologies (12 months)
This is the heart of the roadmap — 12 months across 5 domains. The order matters: ASR and TTS share many components (acoustic encoders, vocoders), and Speaker Verification is a prerequisite for Diarization.
Transformers — The Bridge
Before diving into any speech domain, read the original Attention is All You Need paper (Vaswani et al., 2017) and Jay Alammar’s illustrated walkthrough. The Transformer architecture is the backbone of virtually every modern speech model.
🎙 Automatic Speech Recognition
ASR converts speech audio into text. The learning path follows the historical arc of the field:
CTC (Connectionist Temporal Classification) — The key insight that enables training sequence-to-sequence models without explicit alignment between input frames and output characters. Almost every ASR model uses this or a variant.
SpecAugment (2019) — A simple but highly effective data augmentation technique: randomly mask time and frequency bands in the spectrogram. Still widely used today.
Wav2Vec 2.0 (2020) — Facebook’s self-supervised model that learns speech representations from unlabeled audio, then fine-tunes with a small labeled dataset. Demonstrated that 10 minutes of labeled data could outperform models trained on 100× more labeled data.
Whisper (2022) — OpenAI’s large-scale multilingual ASR model trained on 680,000 hours of weakly-supervised web data. The go-to baseline for any new ASR project. Handles noise, accents, and multilingual transcription remarkably well.
Fast Conformer (2023) — NVIDIA’s efficient Conformer variant using depthwise separable convolutions, achieving near-SOTA accuracy with significantly lower latency.
For practical training, SpeechBrain provides clean, well-documented recipes for common ASR tasks.
🗣 Text-to-Speech
TTS converts text into natural-sounding speech. It’s a richer engineering domain than ASR — involving acoustic model design, vocoder selection, and prosody control.
WaveNet (2016) — DeepMind’s autoregressive waveform generation model. Not practical for production (thousands of autoregressive steps), but foundational as the first neural vocoder to produce natural-sounding speech.
Tacotron (2017) — The first practical end-to-end TTS system: text → spectrogram using seq2seq with attention. Paired with WaveNet as a vocoder, it produced the first convincingly natural TTS.
FastSpeech 2 (2020) — Non-autoregressive TTS using explicit duration, pitch, and energy predictors. Much faster inference than autoregressive models, with controllable prosody.
HiFi-GAN (2020) — A GAN-based vocoder (spectrogram → waveform) that produces high-fidelity audio at real-time speeds. Replaced WaveNet/WaveGlow as the standard vocoder choice.
VITS (2021) — End-to-end TTS combining a VAE with adversarial training and a flow-based decoder. One model handles everything from text to waveform — no separate vocoder needed. Still a strong production baseline.
Kokoro TTS (2024) — A compact 82M-parameter model based on StyleTTS2, achieving near-SOTA quality at a fraction of the computational cost. Excellent for resource-constrained deployment.
For Vietnamese specifically, Viphoneme and Text2PhonemeSequence are essential for grapheme-to-phoneme (G2P) conversion.
🔐 Speaker Verification
Speaker Verification answers “Is this the same person?” — matching or verifying a speaker’s identity from a short audio clip.
X-vector (2017) — TDNN-based speaker embeddings trained with cross-entropy on speaker ID. Became the standard speaker representation for most systems.
ECAPA-TDNN (2020) — Emphasized Channel Attention, Propagation and Aggregation in TDNN. Introduced SE-blocks, multi-scale feature aggregation, and dense connections. Became the dominant architecture for speaker verification benchmarks.
CAM++ (2023) — Context-aware masking with a lightweight backbone. Achieves ECAPA-level accuracy at lower compute cost — a good default for production systems.
ERes2NetV2 / RedimNet (2024) — Current SOTA approaches, combining efficient resnet-style backbones with more sophisticated pooling and aggregation strategies.
The key dataset for training and benchmarking is VoxCeleb, with over 1M utterances from 7,000+ speakers.
👥 Speaker Diarization
Speaker Diarization answers “Who spoke when?” — segmenting an audio recording by speaker identity. It is a core component in meeting transcription, call center analytics, and multi-speaker ASR.
pyannote.audio (2019) — The foundational open-source framework for neural diarization. Provides building blocks (VAD, segmentation, embedding, clustering) that are still widely used.
Multi-Scale Diarization / NeMo (2022) — Addresses the single-scale limitation by extracting speaker embeddings at multiple segment lengths and combining them with a learned multi-scale diarization decoder (MSDD). Significantly improves accuracy on short speaker turns.
DiarizationLM (2024) — Uses an LLM as a post-processing step to refine diarization output by leveraging transcript-level coherence. Shows the emerging pattern of using LLMs to correct structured outputs from specialized models.
Sortformer (2024) — Integrates diarization and ASR into a single end-to-end framework. Solves the speaker label permutation problem with Sort Loss: speakers are always ordered by their first appearance time (Arrival Time Order), eliminating expensive permutation-invariant training.
Streaming Sortformer (2025) — Extends Sortformer to online/streaming settings, achieving SOTA on real-time diarization benchmarks.
For a deeper dive into the diarization landscape, see my earlier post: Speaker Diarization: From Traditional Methods to the Modern Models.
🎭 Voice Conversion
Voice Conversion (VC) transforms a speaker’s voice to sound like a target speaker while preserving linguistic content. It’s the enabling technology behind voice cloning, dubbing, and privacy-preserving speech.
AutoVC (2019) — Zero-shot style transfer using an autoencoder with a carefully designed information bottleneck: content encoder is forced to discard speaker identity while a separate speaker encoder captures voice characteristics.
kNN-VC (2023) — Surprisingly simple but effective: replace source speech features with nearest-neighbor matches from the target speaker’s feature space. No adversarial training, no complex architecture — just nearest-neighbor matching works remarkably well.
Seed-VC (2024) — Robust zero-shot VC and singing voice conversion using a diffusion-based backbone with a content encoder disentangled from speaker identity. Handles challenging cases like singing and cross-lingual conversion.
Phase 4 — Research Trends (Continuous)
The field is now entering an Audio Language Model era, where speech models are either integrated into or built on top of large language models.
Qwen-Audio (2023) — Universal audio understanding through a multimodal LLM that accepts audio and text as joint input.
CosyVoice / CosyVoice 3 (2024–2025) — Alibaba’s multilingual zero-shot TTS system. CosyVoice 3 supports streaming synthesis and voice cloning with high naturalness across 50+ languages.
F5-TTS (2024) — Flow matching-based TTS that achieves high quality without adversarial training, with a cleaner and more stable training procedure than GAN-based systems.
Qwen3-ASR / Qwen3-TTS (2026) — Current SOTA from Alibaba: 52-language ASR and multilingual streaming TTS with voice cloning capability. Represent the convergence of speech and large-scale language modeling.
The pattern across all these models is consistent: speech-specific architectures are being replaced by general sequence models (Transformers, diffusion, flow matching) trained at scale, with speech-specific inductive biases encoded in the data preparation and tokenization rather than the architecture.
Conclusion
The 18-month roadmap boils down to a simple progression:
Signals → Features → Representations → Tasks → Scale
Start with understanding what audio is (signals, spectrograms). Learn to extract useful representations (STFT, MFCCs, embeddings). Study the major task-specific architectures. Then follow the field as it scales up and converges with LLMs.
The materials — including competition papers from VLSP 2021 and 2025 — are all openly available in the roadmap repository. Each section links to the specific papers and tutorials I personally found most valuable.
Speech AI is one of the richest intersections of signal processing, deep learning, and linguistics. The 18-month path is challenging but the field rewards investment — production speech systems are everywhere and the gap between research and real-world application is smaller than in most AI domains.
