Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPublished on 2026-03-30Last updated on 2026-03-30

Abstract

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Related Questions

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Related Reads

20 Billion Valuation, Alibaba and Tencent Competing to Invest, Whose Money Will Liang Wenfeng Take?

DeepSeek, an AI startup founded by Liang Wenfeng, is reportedly in talks with Alibaba and Tencent for an external funding round that could value the company at over $20 billion. This marks a significant shift, as DeepSeek had previously relied solely on funding from its parent company,幻方量化 (Huanfang Quantitative), and had resisted external investment. The potential valuation would place DeepSeek among the top-tier AI model companies in China, comparable to competitors like MoonDark (valued at ~$18 billion) and ahead of recently listed firms like MiniMax and Zhipu. The funding—which could range from $600 million (for a 3% stake) to $2 billion (for 10%)—is seen as a move to secure resources for model development, retain talent, and support infrastructure needs, particularly as competition in inference models and AI agents intensifies. Both Alibaba and Tencent are eager to invest, not only for financial returns but also to integrate DeepSeek into their broader AI ecosystems. However, DeepSeek’s leadership is cautious about maintaining independence and may prefer financial investors over strategic ones to avoid being locked into a specific tech ecosystem. Alternative options, such as state-backed funds, offer longer-term capital and policy support but may come with slower decision-making and potential constraints on global expansion. With competing AI firms accelerating their IPO plans, DeepSeek’s window for securing optimal terms may be narrowing. The final decision will reflect a trade-off between capital, resources, and strategic independence.

marsbit40m ago

20 Billion Valuation, Alibaba and Tencent Competing to Invest, Whose Money Will Liang Wenfeng Take?

marsbit40m ago

After Losing 97% of Its Market Value, iQiyi Attempts to Use AI to Forcefully Extend Its Lifespan

After losing 97% of its market value since its 2018 peak, iQiyi is aggressively pivoting to AI in a desperate attempt to survive. At its 2026 World Conference, CEO Gong Yu announced an "AI Artist Library" with over 100 virtual performers and a new AIGC platform, "NaDou Pro," promising faster production and lower costs. This shift comes as the company faces severe financial distress: its market cap sits near delisting thresholds at $1.36 billion, with significant losses, declining membership revenue, and depleted cash flow. The AI strategy has sparked controversy. Top actors have issued legal threats against unauthorized digital replicas, while in Hengdian, over 134,000 background actors are seeing their already scarce job opportunities vanish as AI replaces them for background roles. iQiyi's move represents a fundamental shift from being a high-cost content buyer to a landlord" to becoming a "platform capitalist" that transfers production risk to creators. This contrasts with competitors like Douyin (TikTok's Chinese counterpart), which is investing heavily in *real* actor-led short dramas, betting that authentic human connection retains users better than AI-generated content. The article draws a parallel to the 1920s transition to "talkies," which made cinema musicians obsolete but ultimately enriched the art form. In contrast, iQiyi's AI drive is framed not as an artistic evolution but as a cost-cutting measure that could degrade storytelling, replacing genuine human emotion with algorithmically calculated stimulation and potentially numbing audiences' capacity for empathy. The core question remains: can a company focused solely on financial survival preserve the art of storytelling?

marsbit43m ago

After Losing 97% of Its Market Value, iQiyi Attempts to Use AI to Forcefully Extend Its Lifespan

marsbit43m ago

Only a 50% Chance of Passing This Year, Can the CLARITY Bill Succeed Before the Midterm Elections?

The CLARITY Act, which passed the House in July 2025 with strong bipartisan support (294-134), faces a critical juncture in the Senate. The Senate Banking Committee is expected to hold a markup soon, but key issues remain unresolved, including stablecoin yield provisions, DeFi regulations, and securing full Republican committee support. Other contentious points involve the Blockchain Regulatory Certainty Act (BRCA), ethics amendments for government officials, and SEC-related matters. The legislative calendar is tight, with limited time before the midterm elections. If the committee markup is delayed beyond mid-May, the chances of passage in 2026 drop significantly. Senator Cynthia Lummis has warned that failure this year could delay comprehensive crypto market structure legislation until 2030 or later. Galaxy estimates the probability of the CLARITY Act becoming law in 2026 is only about 50%. The bill provides crucial regulatory clarity by defining jurisdictional boundaries between the SEC and CFTC, establishing a path for decentralization, and bringing digital commodity intermediaries under federal regulation. Its passage is seen as vital before potential power shifts in the next Congress, which could bring less favorable leadership to key committees. The timeline is compressed, and the bill must compete for floor time with other priorities like Iran authorization and DHS appropriations. Key hurdles include finalizing the stablecoin yield compromise text, addressing law enforcement concerns about BRCA, and navigating political dynamics around SEC nominations. The outcome of the Banking Committee markup and the level of bipartisan support will be critical indicators of its future success.

marsbit1h ago

Only a 50% Chance of Passing This Year, Can the CLARITY Bill Succeed Before the Midterm Elections?

marsbit1h ago

Trading

Spot
Futures

Hot Articles

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.3k Total ViewsPublished 2024.03.29Updated 2025.06.04

How to Buy ONE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

活动图片