Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPublished on 2026-03-30Last updated on 2026-03-30

Abstract

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Related Questions

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Related Reads

$292 Million KelpDAO Cross-Chain Bridge Hack: Who Should Foot the Bill?

On April 18, 2026, an attacker stole 116,500 rsETH (worth ~$292M) from KelpDAO’s cross-chain bridge in 46 minutes—the largest DeFi exploit of 2026. The stolen assets were deposited into Aave V3 as collateral, causing $177–200M in bad debt and triggering a cascade of losses across nine DeFi protocols. Aave’s TVL dropped by ~$6B overnight. This legal analysis argues that KelpDAO and LayerZero Labs share concurrent liability, with fault apportioned 60%/40%. KelpDAO negligently configured its bridge with a 1-of-1 decentralized verifier network (DVN)—a single point of failure—despite LayerZero’s explicit recommendation of a 2-of-3 setup. LayerZero, which operated the compromised DVN, failed to secure its RPC infrastructure against a known poisoning attack vector. Both protocols’ terms of service cap liability at $200 (KelpDAO) or $50 (LayerZero), but these limits are likely unenforceable due to unconscionability, gross negligence exceptions, and potential securities law invalidation (if rsETH is deemed a security under the Howey test). Aave’s governance also faces fiduciary duty claims for raising rsETH’s loan-to-value ratio to 93%—far above competitors’ 72–75%—without adequately assessing bridge risks, amplifying the systemic fallout. Practical recovery targets include LayerZero Labs (a registered Canadian entity), KelpDAO’s founders, auditors, and identifiable Aave governance delegates. The incident underscores escalating legal risks for DeFi protocols, infrastructure providers, and governance participants.

marsbit35m ago

$292 Million KelpDAO Cross-Chain Bridge Hack: Who Should Foot the Bill?

marsbit35m ago

Insider Trading in War: 5 People Involved, the Highest Earner Was Arrested

On April 24, the U.S. Department of Justice arrested U.S. Army Special Forces Staff Sergeant Gannon Ken Van Dyke for insider trading related to the capture of Venezuelan President Nicolás Maduro on January 3. Van Dyke allegedly profited over $400,000 by placing bets on a prediction market, Polymarket, using insider knowledge of the covert operation. According to the indictment, Van Dyke registered an account (0x31a5) on December 26 and made a series of bets predicting Maduro’s capture and U.S. military involvement in Venezuela. He withdrew most of his funds on the day of the operation and attempted to obscure his tracks by transferring assets through crypto and brokerage accounts. This case marks the first time the DOJ has prosecuted insider trading on Polymarket. PolyBeats had previously identified five suspicious accounts, including Van Dyke’s—the highest earner—in January. The other accounts, with profits ranging from $34,000 to $145,000, remain under unofficial scrutiny but have not been charged. Their lower profits, indirect access to information, and unclear legal boundaries may complicate prosecution. Polymarket has since strengthened its market integrity rules, explicitly prohibiting trading based on confidential or insider information. Van Dyke’s arrest, nearly four months after his trades, signals increased regulatory attention and the persistent traceability of blockchain-based transactions.

marsbit37m ago

Insider Trading in War: 5 People Involved, the Highest Earner Was Arrested

marsbit37m ago

Bitwise: Bullish on Bitcoin's Performance in the Second Half of the Year, AI and Regulation Will Spark a New Altcoin Season

Bitwise CIO Matt Hougan and Research Lead Ryan Rasmussen express strong bullish sentiment on Bitcoin's long-term prospects, suggesting that its $1 million price target may be too conservative. They argue Bitcoin serves a dual role: as digital gold and a potential global settlement asset, especially amid declining trust in traditional monetary systems. Despite a weak Q1 2026 where nearly all crypto assets and prices saw double-digit declines, the analysts remain optimistic due to strong forward-looking catalysts, including institutional adoption via Bitcoin ETFs from major firms like Morgan Stanley and Goldman Sachs. Geopolitical instability, such as Iran’s mention of using Bitcoin for international payments, increases the value of Bitcoin’s “out-of-the-money call option” as a non-political, global settlement currency. This enhances its appeal beyond a mere store of value. . Additionally, Hougan highlights that a clearer regulatory token framework under current SEC leadership, combined with AI efficiency gains and high-performance blockchains, could fuel a new “altseason” by late 2026. This may lead to a wave of legitimate, value-capturing token projects, unlike the earlier ICO boom. . Bitwise also announced an Avalanche ETF, citing its unique architecture and rapid growth in real-world asset (RWA) tokenization, which has surged 10x to nearly $30 billion in two years. The firm believes Layer 1 blockchains are still early in their growth cycle, with significant potential ahead.

marsbit1h ago

Bitwise: Bullish on Bitcoin's Performance in the Second Half of the Year, AI and Regulation Will Spark a New Altcoin Season

marsbit1h ago

Trading

Spot
Futures

Hot Articles

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.3k Total ViewsPublished 2024.03.29Updated 2025.06.04

How to Buy ONE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

活动图片