A curated 18-month learning roadmap for becoming an AI Speech Engineer — covering foundations, core technologies (ASR, TTS, Speaker Verification, Diarization, Voice Conversion), and the latest Audio Language Models, distilled from 6 years of hands-on experience.
An analysis of why language models hallucinate — hallucinations arise from statistical pressures in training and evaluation procedures that reward guessing over acknowledging uncertainty.
A guide to writing technical content in Academic — highlighting code snippets, rendering math equations, and drawing diagrams from text.
Speaker Diarization answers “Who spoken when?” — covering core concepts, traditional and modern end-to-end approaches, and the latest Sortformer model for speaker segmentation.
Understanding entropy and why it’s a core concept in decision trees, neural networks, and loss functions like cross-entropy.
Exploring LoRA-Whisper, a scalable and efficient approach for multilingual ASR using Low-Rank Adaptation to fine-tune OpenAI’s Whisper model while avoiding catastrophic forgetting across languages.
FlashAttention is a groundbreaking optimization technique for computing attention in Transformer models, drastically improving GPU memory efficiency through inner vs outer loop restructuring.
An overview of adversarial attacks on large language models (LLMs) — how manipulated inputs can deceive models into generating harmful or incorrect outputs, covering key attack types, implications, and defense strategies.
A detailed summary of the GLiNER paper, introducing a lightweight, scalable, and highly effective model for open-type named entity recognition using bidirectional transformers with zero-shot generalization.
A curated collection of bash functions, troubleshooting commands, and performance tweaks that I often use in my daily workflow.