Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPublished on 2026-03-30Last updated on 2026-03-30

Abstract

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Can a Hair Dryer Earn $34,000? Deciphering the Reflexivity Paradox in Prediction Markets

An individual manipulated a weather sensor at Paris Charles de Gaulle Airport with a portable heat source, causing a Polymarket weather market to settle at 22°C and earning $34,000. This incident highlights a fundamental issue in prediction markets: when a market aims to reflect reality, it also incentivizes participants to influence that reality. Prediction markets operate on two layers: platform rules (what outcome counts as a win) and data sources (what actually happened). While most focus on rules, the real vulnerability lies in the data source. If reality is recorded through a specific source, influencing that source directly affects market settlement. The article categorizes markets by their vulnerability: 1. **Single-point physical data sources** (e.g., weather stations): Easily manipulated through physical interference. 2. **Insider information markets** (e.g., MrBeast video details): Insiders like team members use non-public information to trade. Kalshi fined a剪辑师 $20,000 for insider trading. 3. **Actor-manipulated markets** (e.g., Andrew Tate’s tweet counts): The subject of the market can control the outcome. Evidence suggests Tate’sociated accounts coordinated to profit. 4. **Individual-action markets** (e.g., WNBA disruptions): A single person can execute an event to profit from their pre-placed bets. Kalshi and Polymarket handle these issues differently. Kalshi enforces strict KYC, publicly penalizes insider trading, and reports to regulators. Polymarket, with its anonymous wallet-based system, has historically been more permissive, arguing that insider information improves market accuracy. However, it cooperated with authorities in the "Van Dyke case," where a user traded on classified government information. The core paradox is reflexivity: prediction markets are designed to discover truth, but their financial incentives can distort reality. The more valuable a prediction becomes, the more likely participants are to influence the event itself. The market ceases to be a mirror of reality and instead shapes it.

marsbit46m ago

Can a Hair Dryer Earn $34,000? Deciphering the Reflexivity Paradox in Prediction Markets

marsbit46m ago

Analyst Reveals Accumulation Level For Dogecoin Before It Rallies To $2

A crypto analyst, Crypto Patel, predicts that Dogecoin (DOGE) could rally to $2, despite currently trading below $0.10. The analyst identifies a key accumulation zone between $0.07 and $0.09, where DOGE has repeatedly tested support without breaking down. Based on a bi-weekly chart using Elliott Wave theory, DOGE is in a Wave 4 consolidation phase within a descending channel. A bounce from this support is expected to trigger a Wave 5 rally, projecting a 2,767% surge toward $2. Price targets are set sequentially at $0.50, $1, and $2, with a stop-loss below $0.048. For a bullish trend reversal, DOGE must break above the $0.10 resistance level, which was recently rejected in April. Analysts emphasize that market conditions and a confirmed higher high are critical for the projected uptrend.

bitcoinist1h ago

Analyst Reveals Accumulation Level For Dogecoin Before It Rallies To $2

bitcoinist1h ago

Weekly Editor's Picks (0418-0424)

Weekly Editor's Picks (0418-0424) provides in-depth analysis on key developments across macro trends, crypto markets, and policy. Key topics include the oil market nearing a physical supply crisis due to shipping disruptions, even if the Strait of Hormuz reopens. In crypto, reports cover global consumer crypto adoption, a data-driven strategy for trading volatile altcoins, and the state of VC funding. Prediction markets like Polymarket are analyzed not as pure event-guessing games but as systems where understanding legalistic rules creates an edge. Major DeFi protocol Aave is criticized for poor crisis management amid a $300M exploit, while the controversial structure of World Liberty Financial is examined. Other highlights include policy updates like the CLARITY Act, Ethereum’s, airdrop opportunities, and weekly recaps of major incidents like the Kelp DAO hack and SpaceX's AI disclosures.

marsbit1h ago

Telegram Founder Claims French Officials Sold Crypto Data, Linked To 41 Kidnaps

Telegram founder Pavel Durov has accused French officials of selling crypto owners’ data to criminals, linking it to a sharp rise in kidnappings. France has seen 41 crypto-related kidnappings this year, part of a growing international trend known as “wrench attacks.” Durov criticized French policies requiring identity data and access to private messages, arguing that increased data flow leads to more leaks and victims. The incidents, which began in late 2024, have escalated significantly in 2026, now accounting for over half of all organized kidnappings tracked by French intelligence. In response, the government plans a broader crackdown, including a dedicated police unit, improved international coordination, and a new prevention platform for threat alerts and security guidance.

bitcoinist1h ago

Telegram Founder Claims French Officials Sold Crypto Data, Linked To 41 Kidnaps

bitcoinist1h ago

First Day Review of "Musk's WeChat" XChat: Even Worse Than Expected

Elon Musk's much-anticipated "WeChat-like" app, XChat, has officially launched after multiple delays. The initial review reveals a product that falls short of expectations, offering an experience largely similar to X Platform's (formerly Twitter) direct messages, despite being marketed as an encrypted communication tool. Key observations from the first-day test include: 1. The app's promoted "end-to-end encryption" and its claimed relation to Bitcoin's architecture were criticized by experts as a superficial attempt to capitalize on crypto buzz, with no real technical connection. 2. Musk's vision of an ad-free "secure communication system" is technically met, but only because the app is currently extremely basic, featuring only a single chat interface. 3. A promised anti-screenshot feature appears inconsistent; it works in X Platform group chats but fails within the XChat app itself, where screenshots still capture avatars. 4. The app supports 45 languages and has a 16+ age rating, indicating a broader tolerance for content compared to WeChat's 13+ rating. 5. A puzzling login process requires users to verify the email associated with their X account. 6. The touted encryption" feels minimal in practice, with its presence only indicated by a simple "Encrypted - Yes" label on messages. 7. Disappearing message timers for groups can be set from 5 minutes to 4 weeks, with the timer starting upon being read by a user. 8. Group invite links are shared with X Platform groups. 9. Group size limits are planned to be increased, aiming for 1000 members, a move that has drawn user criticism. 10. The app offers 8 different colored icons, and its chat bubbles are notably similar to WeChat's. Message deletion options mimic Telegram's. Crucially, many pre-announced features like importing X contacts, integrating Grok AI, X Money payments, and Cashtags are not yet available. The initial release is seen as a bare-bones and underwhelming first step.

Odaily星球日报1h ago

First Day Review of "Musk's WeChat" XChat: Even Worse Than Expected

Odaily星球日报1h ago

Trading

Spot

Futures

Hot Articles

What Is Superchain? Understanding How Superchain Governs and Works in One Article

OP Chain has become a catchy term recently. What is an OP Chain? And what is Superchain? How do Superchain and OP Chains relate? How does Superchain operate and manage?

2.9k Total ViewsPublished 2023.08.13Updated 2024.02.18

What Is Superchain? Understanding How Superchain Governs and Works in One Article

How to Buy ONE

Welcome to HTX.com! We've made purchasing Harmony (ONE) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Harmony (ONE) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Harmony (ONE)After purchasing your Harmony (ONE), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Harmony (ONE)Easily trade Harmony (ONE) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

3.3k Total ViewsPublished 2024.03.29Updated 2025.06.04

Understanding Bitcoin Halving in One Article

In this article, we'll delve into key concepts related to Bitcoin halving.

18.3k Total ViewsPublished 2024.04.16Updated 2024.04.16

Understanding Bitcoin Halving in One Article

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of ONE (ONE) are presented below.

Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

Abstract

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

Related Questions

Related Reads

Can a Hair Dryer Earn $34,000? Deciphering the Reflexivity Paradox in Prediction Markets

Analyst Reveals Accumulation Level For Dogecoin Before It Rallies To $2

Weekly Editor's Picks (0418-0424)

Telegram Founder Claims French Officials Sold Crypto Data, Linked To 41 Kidnaps

First Day Review of "Musk's WeChat" XChat: Even Worse Than Expected

Trading

Hot Articles

What Is Superchain? Understanding How Superchain Governs and Works in One Article

How to Buy ONE

Understanding Bitcoin Halving in One Article

Discussions

Top Questions