Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitОпубліковано о 2026-03-30Востаннє оновлено о 2026-03-30

Анотація

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Пов'язані питання

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Пов'язані матеріали

Eight Years of Entrepreneurship Notes from a16z's AI Partner

An early generative AI entrepreneur reflects on his 8-year journey building Rosebud AI, founded in 2018—a time when the field was still called “synthetic media.” Initially experimenting with models like CycleGAN and StyleGAN, he believed AI could make creation as intuitive as playing a game. Over the years, his team launched multiple products, including the viral app TokkingHeads, which gained 2 million users, learning to design around imperfect model outputs to deliver “good enough” user experiences. The evolution from niche synthetic media to general-purpose AI infrastructure—especially after GPT-4’s release—reshaped product possibilities. Code generation matured enough by 2023 to enable text-to-game prototyping. The author emphasizes that the real differentiator now isn’t just model capability but product design, distribution, and business model innovation. Having stepped down as CEO of Rosebud AI, he joins a16z as a partner focused on investing in the frontier model stack—models, infrastructure, and tools. He remains optimistic about AI-driven progress in creative tools, coding, and scientific domains. The piece concludes with a forward-looking note: the next phase of AI will be less about what’s possible and more about how capabilities are productized and scaled in the real world.

marsbit27 хв тому

Eight Years of Entrepreneurship Notes from a16z's AI Partner

marsbit27 хв тому

How Many Tokens Away Is Yang Zhilin from the 'Moon Chasing the Light'?

The article explores the intense competition between two leading Chinese AI companies, DeepSeek and Kimi (Moon Dark Side), and the mounting pressure on Yang Zhilin, the founder of Kimi. While DeepSeek re-emerged after 15 months of silence with its powerful V4 model—boasting 1.6 trillion parameters and low-cost, long-context capabilities—Kimi has been focusing on long-context processing and multi-agent systems with its K2.6 model. Yang faces a threefold challenge: technological rivalry, commercialization pressure, and investor expectations. Despite Kimi’s high valuation (reaching $18 billion), its revenue heavily relies on a single product with low paid conversion rates, while DeepSeek’s strategic silence and open-source influence have strengthened its market position and valuation prospects, now targeting over $20 billion. Both companies reflect broader trends in China’s AI ecosystem: Kimi aims for global influence through open-source contributions and agent-based advancements, while DeepSeek prioritizes foundational innovation and hardware independence, notably shifting to Huawei’s chips. Their competition is seen as vital for China’s AI progress, with the gap between top Chinese and U.S. models narrowing to just 2.7% on the Elo rating scale. Ultimately, the article argues that this rivalry, though anxiety-inducing for leaders like Zhilin, is essential for driving innovation and solidifying China’s role in the global AI landscape.

marsbit1 год тому

How Many Tokens Away Is Yang Zhilin from the 'Moon Chasing the Light'?

marsbit1 год тому

TechFlow Intelligence Bureau: ChatGPT Helps Amateur Mathematician Crack 60-Year-Old Problem, CFTC Sues New York Regulator Over Coinbase and Gemini

An amateur mathematician, with the assistance of ChatGPT, has solved a combinatorial mathematics puzzle originally proposed by Hungarian mathematician Paul Erdős in the 1960s. This marks another milestone in AI-aided mathematical research, demonstrating the evolving capabilities of large language models in formal reasoning. In other AI developments, OpenAI introduced a new privacy filter tool for enterprise API usage, automatically screening sensitive data. Meanwhile, the Qwen3.6-27B model achieved 100 tokens per second on a single RTX 5090 GPU using quantization, significantly lowering the cost barrier for local AI deployment. In crypto and Web3, the U.S. CFTC sued New York’s financial regulator, challenging its oversight of Coinbase and Gemini—a first-of-its-kind federal-state regulatory clash. Following a vulnerability, KelpDAO and major DeFi protocols established a recovery fund. Tether froze $344 million in assets linked to Iran’s central bank upon U.S. Treasury request, highlighting the centralized control risks in stablecoins. Separately, Litecoin underwent a 3-hour chain reorganization to undo a privacy-layer exploit. In the U.S., former President Trump invoked the Defense Production Act to address power grid bottlenecks affecting AI data centers and dismissed the entire National Science Board, raising concerns over research independence. A retail trader gained 250% on a $600k Intel options bet amid AI-related speculation. Xiaomi announced its first performance electric vehicle, targeting rivals like Tesla. Meanwhile, iPhone users reported devices automatically reinstalling a hidden app daily, suspected to be MDM-related. A Chinese securities report noted that A-share institutional crowding has reached its second-longest streak since 2007, signaling high valuations and potential style rotation. The day’s developments reflect a dual narrative: AI is enabling unprecedented individual breakthroughs, while centralized power structures—whether governmental or corporate—are becoming more assertive, underscoring that decentralization is as much a political-economic challenge as a technical one.

marsbit1 год тому

TechFlow Intelligence Bureau: ChatGPT Helps Amateur Mathematician Crack 60-Year-Old Problem, CFTC Sues New York Regulator Over Coinbase and Gemini

marsbit1 год тому

Iran’s Crypto Lifeline Hit As US Freezes $344 Million In Funds

The US Treasury has frozen $344 million in cryptocurrency tied to Iranian military and political groups, specifically targeting two Tron blockchain wallets connected to the Islamic Revolutionary Guard Corps and Hizballah. This action follows reports that Iran was collecting Bitcoin payments from ships for safe passage through the Strait of Hormuz. The freeze occurred just one day after stablecoin issuer Tether locked the same amount of funds at the request of US law enforcement. The move is part of broader US efforts to disrupt Iran's ability to generate and move funds, highlighting the limitations of using cryptocurrency as a sanctions workaround when centralized issuers comply with enforcement actions.

bitcoinist2 год тому

Iran’s Crypto Lifeline Hit As US Freezes $344 Million In Funds

bitcoinist2 год тому

Amidst the Volatility in the Crypto Market, Who is Buying Against the Trend?

"In a volatile crypto market downturn of Q1 2026, institutional players demonstrated divergent strategies. While Bitcoin fell over 25% and Ethereum dropped 35%, key institutions accumulated: corporate treasuries like Strategy (MicroStrategy) added over $10 billion in BTC, sovereign wealth funds like Mubadala increased IBIT holdings by 46%, and major banks launched crypto services and ETFs. Notably, 26 new crypto ETFs were filed or launched under streamlined SEC rules, including bank-led products from Morgan Stanley and BlackRock’s staking ETH ETF. VC funding saw concentration: total investment reached ~$5 billion, but deal count plummeted 49%. Three mega-deals—BVNK ($1.8B), Kalshi ($1B), and Polymarket ($600M)—accounted for half the quarter’s funding. Investment shifted from DeFi/NFTs to payments, prediction markets, and regulated CeFi infrastructure. Selling pressure came from hedge funds (e.g., Brevan Howard cut IBIT by 85%) and Bitcoin miners liquidating holdings. The U.S. strategic Bitcoin reserve saw zero additions, highlighting that real buying came from corporations and sovereign funds, not governments. The institutional crypto landscape is bifurcating: long-term holders accumulate while short-term traders exit, signaling a foundation for the next cycle."

marsbit2 год тому

Amidst the Volatility in the Crypto Market, Who is Buying Against the Trend?

marsbit2 год тому

Торгівля

Спот

Ф'ючерси

Обговорення

Ласкаво просимо до спільноти HTX. Тут ви можете бути в курсі останніх подій розвитку платформи та отримати доступ до професійної ринкової інформації. Нижче представлені думки користувачів щодо ціни ONE (ONE).

Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

Анотація

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

Пов'язані питання

Пов'язані матеріали

Eight Years of Entrepreneurship Notes from a16z's AI Partner

How Many Tokens Away Is Yang Zhilin from the 'Moon Chasing the Light'?

TechFlow Intelligence Bureau: ChatGPT Helps Amateur Mathematician Crack 60-Year-Old Problem, CFTC Sues New York Regulator Over Coinbase and Gemini

Iran’s Crypto Lifeline Hit As US Freezes $344 Million In Funds

Amidst the Volatility in the Crypto Market, Who is Buying Against the Trend?

Торгівля

Популярні статті

Як купити ONE

Обговорення