Key Responsibilities:
- Train speech synthesis mel spectrogram and vocoder models
- Measure and benchmark model performance across use cases
- Maintain and enhance text to speech evaluation systems
- Analyze model accuracy and bias and recommend improvements
- Improve processes related to speech data preparation, augmentation, and filtering
- Develop and refine training datasets for speech models
- Characterize performance and quality metrics across different platforms
- Collaborate with cross functional teams to deliver new product features
- Participate in code development, design reviews, and test planning
- Identify issues, propose solutions, and contribute to continuous innovation
Required Qualifications:
- Master’s degree or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, Applied Mathematics, Linguistics, or Computational Linguistics or equivalent experience
- Minimum of 5 years of relevant experience
- Strong programming skills in Python
- Solid understanding of programming fundamentals and software design
- Deep knowledge of machine learning and deep learning techniques including CNN, RNN, LSTM, and Transformers
- Experience applying deep learning to speech synthesis, large language models, and speech to speech translation
- Hands on experience with speech technologies such as speech synthesis and voice cloning
- Experience training speech models
- Proficiency with PyTorch deep learning frameworks
- Knowledge of speech signal processing techniques including FFT, MFCC, and mel spectrograms
- Familiarity with version control tools such as Git, Gerrit, or GitLab
- Strong collaboration and communication skills in a matrixed environment
Preferred Qualifications:
- Fluency in one or more languages such as Spanish, Mandarin, German, Japanese, Russian, French, Arabic, Hindi, Korean, Italian, or Portuguese
- Experience with multilingual or code switched text to speech systems
- Experience with voice cloning and cross lingual voice cloning
- Knowledge of text normalization and inverse text normalization using neural networks or WFST
- Experience working with grapheme to phoneme systems for multiple languages
- Interest in linguistics, phonetics, and language technologies
- Strong C plus plus programming skills
- Familiarity with GPU technologies such as CUDA, cuDNN, or TensorRT
- Experience deploying machine learning models to cloud, data center, or embedded systems
