1. Home
  2. AI Tools
  3. MaskGCT

MaskGCT

MaskGCT is a new approach to text-to-speech (TTS) and Voice Cloning that simplifies the process by removing the need for explicit alignment between text and speech.

Categories:Voice CloningTTS

MaskGCT is a new approach to text-to-speech (TTS) that simplifies the process by removing the need for explicit alignment between text and speech. It improves upon existing models by generating speech in a non-autoregressive way, meaning it doesn't predict durations for individual speech sounds, which can often affect the natural flow of speech.

Key Features:

  • Zero-shot text-to-speech: Generate speech in any voice, even if the model hasn't been trained on that specific speaker.

  • Simplified process: MaskGCT eliminates the need for alignment between text and speech, streamlining the training process.

  • Two-stage model: First, text is turned into semantic tokens, then the model generates the corresponding acoustic tokens to create the final speech. This allows for efficient speech generation.

  • Mask-and-predict method: The model learns to fill in missing information by predicting masked tokens, resulting in high-quality speech generation.

Experiments on 100K hours of real-world speech show that MaskGCT produces better quality, similarity, and intelligibility compared to other zero-shot TTS systems.

Leave your comment

© 2024Opendemo