Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitОпубліковано о 2026-03-30Востаннє оновлено о 2026-03-30

Анотація

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Трендові криптовалюти

Пов'язані питання

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Пов'язані матеріали

The Entire Internet Hails Noam's Joining, But OpenAI's Loss Bill Just Got Thicker

While the AI community celebrates Noam Shazeer, co-author of the "Attention Is All You Need" paper, joining OpenAI as Head of Architectural Research, the company's audited financials reveal a starkly different reality. In 2025, OpenAI reported $13.07 billion in revenue but a massive $20.92 billion operating loss. Even excluding a one-time accounting charge, the cash burn is severe, with $3.7 billion consumed in Q1 2026 alone. This high-profile hiring occurs against a backdrop of significant internal research talent drain, with key founders and researchers departing as the company's focus shifts from exploratory research to product iteration. Meanwhile, OpenAI's fundamental business model faces a deep crisis. It paid Microsoft $10.59 billion for compute in 2025, while its vast user base of 9 billion weekly actives includes only 50 million paying customers, making growth a direct driver of escalating costs. The article argues Shazeer's recruitment is less about technical necessity and more about crafting a compelling narrative for OpenAI's upcoming IPO, aiming to justify a rumored $1 trillion valuation to future public market investors. It contrasts OpenAI's strategy with Anthropic's reported path to profitability, which relies on a strong enterprise customer base and cost control, rather than star-powered narratives. Ultimately, the piece concludes that while Shazeer's architectural work may take 1-2 years to materialize, OpenAI's financial clock is ticking much faster, with its massive losses undercutting the celebratory headlines.

marsbit1 год тому

The Entire Internet Hails Noam's Joining, But OpenAI's Loss Bill Just Got Thicker

marsbit1 год тому

Market Trend (June 19): US-Iran Deal Drives Out Geopolitical Premium; Chip Stocks Soar to New Highs; Energy Sector Leads Declines

U.S. Market Trends (June 19): U.S.-Iran Deal Eases Tensions, Chip Stocks Soar, Energy Sector Leads Declines. U.S. stocks rallied on Thursday as the signing of a temporary U.S.-Iran deal in Geneva de-escalated Middle East tensions, with Saudi oil tankers transiting the Strait of Hormuz. This geopolitical relief helped markets recover from recent Fed-driven volatility. The S&P 500 rose over 1%, the Nasdaq gained nearly 2%, and the Dow Jones Industrial Average closed at another record high. The Philadelphia Semiconductor Index surged over 6% to a historic peak. Chip stocks were the standout performers. Reports of an Apple-Intel design and foundry deal for certain products, alongside mentions of potential Nvidia and SpaceX collaborations with Intel, propelled the sector. Intel surged ~10.5%, while memory chip makers like Micron also saw significant gains, highlighting sustained confidence in long-term AI capital expenditure. In contrast, the energy sector was the day's sole loser, with the S&P 500 energy sub-index declining as WTI crude fell ~2% to around $74.29/barrel. The reopening of key shipping routes erased prior geopolitical risk premiums. SpaceX extended losses for a second day on news of a potential large bond offering. Market volatility (VIX) dropped sharply, indicating a swift reversal of post-Fed jitters. Treasury yields dipped slightly but remained elevated. The focus now shifts to upcoming economic data, including next week's PCE inflation report and Micron's earnings, which will serve as a key test for the AI trade's durability.

marsbit2 год тому

Market Trend (June 19): US-Iran Deal Drives Out Geopolitical Premium; Chip Stocks Soar to New Highs; Energy Sector Leads Declines

marsbit2 год тому

Торгівля

Спот
Ф'ючерси

Популярні статті

Як купити ONE

Ласкаво просимо до HTX.com! Ми зробили покупку Harmony (ONE) простою та зручною. Дотримуйтесь нашої покрокової інструкції, щоб розпочати свою криптовалютну подорож.Крок 1: Створіть обліковий запис на HTXВикористовуйте свою електронну пошту або номер телефону, щоб зареєструвати обліковий запис на HTX безплатно. Пройдіть безпроблемну реєстрацію й отримайте доступ до всіх функцій.ЗареєструватисьКрок 2: Перейдіть до розділу Купити крипту і виберіть спосіб оплатиКредитна/дебетова картка: використовуйте вашу картку Visa або Mastercard, щоб миттєво купити Harmony (ONE).Баланс: використовуйте кошти з балансу вашого рахунку HTX для безперешкодної торгівлі.Треті особи: ми додали популярні способи оплати, такі як Google Pay та Apple Pay, щоб підвищити зручність.P2P: Торгуйте безпосередньо з іншими користувачами на HTX.Позабіржова торгівля (OTC): ми пропонуємо індивідуальні послуги та конкурентні обмінні курси для трейдерів.Крок 3: Зберігайте свої Harmony (ONE)Після придбання Harmony (ONE) збережіть його у своєму обліковому записі на HTX. Крім того, ви можете відправити його в інше місце за допомогою блокчейн-переказу або використовувати його для торгівлі іншими криптовалютами.Крок 4: Торгівля Harmony (ONE)Легко торгуйте Harmony (ONE) на спотовому ринку HTX. Просто увійдіть до свого облікового запису, виберіть торгову пару, укладайте угоди та спостерігайте за ними в режимі реального часу. Ми пропонуємо зручний досвід як для початківців, так і для досвідчених трейдерів.

367 переглядів усьогоОпубліковано 2024.12.12Оновлено 2026.06.02

Як купити ONE

Обговорення

Ласкаво просимо до спільноти HTX. Тут ви можете бути в курсі останніх подій розвитку платформи та отримати доступ до професійної ринкової інформації. Нижче представлені думки користувачів щодо ціни ONE (ONE).

活动图片