Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPublished on 2026-03-30Last updated on 2026-03-30

Abstract

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Trending Cryptos

Related Questions

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Related Reads

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

Playnance's native token, $GCOIN, has been listed on the cryptocurrency exchange KoinBX as of June 18. This move aims to enhance accessibility for its rapidly growing community, particularly in India, where the blockchain-powered Web3 iGaming ecosystem has gained significant traction. Over 130 partners in Playnance's "Be the Boss" program have built communities engaging thousands of active players in the region. The "Be the Boss" model allows participants to create and manage their own gaming communities, earning rewards tied to community activity. CEO Pini Peter noted India's high engagement, with community leaders successfully building player networks. One partner, Dr. Nicolas, reported earning over $57,000 through the program in recent months, highlighting both the financial rewards and the opportunity to grow an engaged community. $GCOIN serves as the ecosystem's core utility token, incentivizing participation and aligning the interests of players and community leaders ("Bosses"). The listing on KoinBX is part of Playnance's strategy to expand globally, increasing the token's utility and accessibility by combining community ownership, gamified engagement, and blockchain-based incentives. Founded in 2020, Playnance is a Web3 iGaming infrastructure company focused on creating live, non-custodial, on-chain products to onboard mainstream users. It currently processes approximately one million transactions daily, aiming to simplify the user experience while maintaining full on-chain transparency.

TheNewsCrypto33m ago

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

TheNewsCrypto33m ago

STRC Hits Historic Low, Saylor's Perpetual Motion Machine Grinds to a Halt

STRC, the perpetual preferred stock issued by MicroStrategy to fund its Bitcoin purchases, hit a historic low of $85.32, a 17% discount to its $100 par value. Designed as a "digital credit engine" to trade stably near par and enable continuous share issuance for buying Bitcoin, its plunge signals a breakdown in this model. Three key factors drove the decline: 1. Bitcoin's price fell over 50% from its peak, trading around $63,000 amid hawkish Fed signals. 2. MicroStrategy's cash reserves were depleted after a $1.5 billion convertible note repayment, slashing the dividend coverage for STRC's 11.5% yield to ~7 months. The company then sold 32 BTC to cover dividends—Michael Saylor's first Bitcoin sale since 2022—damaging the "never sell" narrative. 3. A competing Bitcoin-backed preferred stock, Strive's SATA, offers a higher yield (~13%) and daily dividends, drawing investors away from STRC. The drop triggers a negative cycle: STRC below par halts ATM share issuances, cutting off a key funding source for Bitcoin buys and potentially forcing more BTC sales for dividends, further eroding confidence. While Saylor argues the model is mathematically sound—needing only 2.3% annual Bitcoin growth to sustain itself—the market is testing the resilience of the leveraged Bitcoin treasury strategy in a bear market. The STRC price now reflects rising skepticism about this financial machinery's durability during downturns.

marsbit54m ago

STRC Hits Historic Low, Saylor's Perpetual Motion Machine Grinds to a Halt

marsbit54m ago

A Guide to Grayscale’s ‘Bottom Fishing’: Using Cash Flow to Assess Cryptocurrency Value

**Title:** Grayscale's Guide to Bottom-Fishing: Valuing Cryptoassets Using Cash Flows **Summary:** This report by Grayscale Research presents a fundamental valuation framework for cryptocurrency assets, moving beyond pure speculation to analyze those with underlying cash flows. It distinguishes between "commodity-like" assets (e.g., Bitcoin) and "cash-flow" assets, primarily within DeFi. Using the leading decentralized lending protocol Aave as a case study, the analysis applies traditional financial methodologies like Discounted Cash Flow (DCF) and Price-to-Earnings (P/E) multiples. Key findings indicate that AAVE tokens are currently undervalued. Despite recent challenges, the protocol's strong revenue growth, ~50% net profit margin, and diversified treasury support a fundamental valuation range of $80-$100 per token (compared to a ~$75 market price at the time of writing). In a base-case scenario driven by stablecoin adoption and regulatory clarity, the fair value could rise to around $175 within a year. The report emphasizes that protocol success does not automatically translate to token value. It critically examines the "value capture" mechanisms—such as buybacks, burns, and staking rewards—that channel protocol profits to token holders. Furthermore, it addresses the legal and governance complexities of Decentralized Autonomous Organizations (DAOs), noting their difference from traditional corporate equity but highlighting how robust, transparent governance can align protocol economics with holder interests. The conclusion is that the crypto market is maturing, with capital increasingly flowing towards projects with demonstrable fundamentals, real adoption, and disciplined capital allocation, creating opportunities for value-based investors.

marsbit2h ago

A Guide to Grayscale’s ‘Bottom Fishing’: Using Cash Flow to Assess Cryptocurrency Value

marsbit2h ago

After semiconductors lead the gains, are funds buying into AI orders or a macroeconomic rebound?

After US-Iran talks led to a temporary ceasefire and framework for reopening the strategic Strait of Hormuz, U.S. stocks rose on June 18, with the Nasdaq gaining 1.9%. The semiconductor and AI hardware sectors outperformed. This rally stemmed primarily from reduced geopolitical risk, which lowered oil prices and inflation expectations, easing discount rate pressure on high-valuation growth stocks like tech. The key question is not whether tech rebounded, but the nature of the rebound. The market appears to be selectively repricing AI infrastructure plays rather than broadly chasing AI narratives. Gains were concentrated in chips, optical interconnects, memory, and domestic manufacturing—segments tied to tangible data center build-outs and capital expenditure. Intel's ~10% surge, fueled by a Trump statement about potential Apple collaboration, exemplifies this mixed dynamic. It reflects policy catalysts and domestic manufacturing sentiment more than confirmed fundamentals. Meanwhile, strong earnings from companies like Astera Labs (revenue up 93% YoY) provided concrete evidence of AI-driven demand in hardware. In essence, the rally represents a risk-premium recalibration. Lower Middle East tensions opened a valuation repair window, and capital flowed first into AI infrastructure segments with visible near-term revenue streams. The sustainability of this move hinges on upcoming Q2 earnings, specifically continued strength in cloud provider capex, AI server orders, and hardware company guidance. Policy hopes alone are insufficient; the cycle needs validation from orders and financials.

marsbit2h ago

After semiconductors lead the gains, are funds buying into AI orders or a macroeconomic rebound?

marsbit2h ago

Trading

Spot
Futures

Hot Articles

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.9k Total ViewsPublished 2024.03.29Updated 2026.06.02

How to Buy ONE

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

活动图片