Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPublicado a 2026-03-30Actualizado a 2026-03-30

Resumen

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Preguntas relacionadas

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Lecturas Relacionadas

Illustrating the Capital Market After DeepSeek V4's Launch: Zhipu and MiniMax Plunge, NVIDIA Panics

DeepSeek V4, a 1T parameter MoE model with a 285B Flash version, has been fully open-sourced under Apache 2.0, triggering significant reactions across capital markets. Chinese AI chipmakers like Cambricon and Hygon saw major stock gains, with Cambricon rising 60% monthly. In contrast, Hong Kong-listed AI firms Zhipu and MiniMax dropped over 7%, facing heavy short-selling. NVIDIA’s shares dipped, with analysts noting a "decoupling" of Chinese and North American AI inference demand. The launch intensified competition in the AI model space, following 11 major releases in 30 days, including GPT-5.5 and Llama 4. Unlike others, V4’s permissive licensing and full open-source release challenged closed-source models on performance, cost, and accessibility. Critically, V4 announced Day-0 support for domestic chips like Huawei’s Ascend 950PR and Cambricon’s Siyuan 590, offering better cost-performance than NVIDIA counterparts. This shift reduces reliance on CUDA, aligning with NVIDIA CEO’s earlier concerns about Chinese AI chips threatening its dominance. The move signals a tangible step in China’s AI supply chain independence, redirecting compute demand to local manufacturers like Hua Hong Semiconductor.

marsbitHace 16 min(s)

Illustrating the Capital Market After DeepSeek V4's Launch: Zhipu and MiniMax Plunge, NVIDIA Panics

marsbitHace 16 min(s)

Crypto Coalition Urges Senate To Fast-Track CLARITY Act As US Leadership Faces Critical Moment

A coalition of over 120 crypto industry organizations, including the Crypto Council for Innovation and the Blockchain Association, is urging the U.S. Senate Banking Committee to fast-track the CLARITY Act, a comprehensive market structure bill for digital assets. They argue that timely legislation is critical to ensure consumer protection, clarify regulatory roles, and maintain U.S. leadership in financial innovation, warning that delay could cede advantages to other jurisdictions. The push comes amid reports of a potential delay until mid-May due to banking sector opposition, particularly concerning restrictions on stablecoin yields. Industry leaders emphasize that May is a crucial window for action before political attention shifts to election campaigns.

bitcoinistHace 22 min(s)

Crypto Coalition Urges Senate To Fast-Track CLARITY Act As US Leadership Faces Critical Moment

bitcoinistHace 22 min(s)

Day 6 of the rsETH Incident: DeFi United Secures Approximately $100 Million in Intentional Commitments, but a $50 Million Gap Remains

On April 18, Kelp DAO’s rsETH LayerZero bridge was exploited, resulting in the unauthorized minting of 116.5k rsETH (approx. $292M). The attacker borrowed around $190M on Aave V3. The Arbitrum Security Council froze 30,766 ETH linked to the incident. DeFi United, a cross-protocol rescue initiative led by Awe, was formed to cover a total shortfall of 112.2k rsETH ($258M). As of April 24, several protocols have pledged around $100M in support, though most commitments are still under DAO voting or discussion. Key pledges include: - Golem: 1,000 ETH ($2.3M) - Aave founder Stani Kulechov: 5,000 ETH ($11.5M) - EtherFi: up to 5,000 ETH ($11.5M) - Lido: up to 2,500 stETH ($5.75M), contingent on full coverage - Mantle: proposed a $69M loan to Aave DAO under specific terms The remaining shortfall is estimated at $50M. Aave’s treasury and safety module (~$236M combined) can cover the worst-case bad debt scenario ($230M). Three potential loss distribution paths were outlined by DefiLlama’s 0xngmi: 1. Uniform 18.5% haircut for all rsETH holders: Aave bad debt ~$216M 2. Only protect Mainnet, abandon L2: bad debt up to $341M 3. Repay only pre-attack holders: technically difficult, ~$91M net loss KelpDAO has not yet announced a specific plan. The success of DeFi United depends heavily on KelpDAO’s final decision on loss allocation.

marsbitHace 27 min(s)

Day 6 of the rsETH Incident: DeFi United Secures Approximately $100 Million in Intentional Commitments, but a $50 Million Gap Remains

marsbitHace 27 min(s)

$467K In Crypto Seized As Spain Cracks Down On Illegal Piracy Platform

Spanish police seized €400,000 ($467,000) in cryptocurrency from two cold wallets hidden inside a wall thermometer during a raid in Almería. Three suspects were arrested in connection with the country’s largest illegal Spanish-language manga distribution platform, operational since 2014. The site generated over €4 million ($4.55 million) in ad revenue by offering pirated content. Authorities have not confirmed whether they can access the seized funds, as cold wallets require PINs or seed phrases. The case highlights challenges law enforcement face in handling crypto seizures, illustrated by custody failures in other countries like South Korea.

bitcoinistHace 1 hora(s)

$467K In Crypto Seized As Spain Cracks Down On Illegal Piracy Platform

bitcoinistHace 1 hora(s)

Kicked Out of PayPal, Musk Aims for a Comeback in the Crypto Market

Elon Musk's X (formerly Twitter) has launched its "Smart Cashtags" feature, generating approximately $1 billion in trading volume within days of its April 2026 pilot launch. The feature allows users to click on stock or crypto tickers (or even full Solana token contract addresses) in posts to view real-time price charts and discussions without leaving the app. Initially available to iPhone users in the US and Canada, with a partnership in Canada enabling direct trading via the Wealthsimple app. This move is part of Musk's broader "Everything App" vision, spearheaded by the upcoming X Money platform. Analysts, such as Mizuho's Dan Dolev, see this as a potential disruptor to the US payments market, even prompting a downgrade of PayPal's stock. X Money's beta offers services like 6% APY on deposits, cashback, and P2P transfers, with speculation it may later incorporate crypto trading and stablecoin settlements for faster transactions. However, the ambitious plan faces significant regulatory scrutiny. Senator Elizabeth Warren has questioned the sustainability of the high 6% yield and raised concerns over X's banking partner, Cross River Bank, which has a history of regulatory violations. Additional risks involve the "GENIUS Act," which may create loopholes for stablecoin issuance without full FDIC insurance coverage, potentially leaving users unprotected. The integration of social trading on a platform with over 500 million users could inject new liquidity and retail interest into the crypto market. Yet, it also amplifies risks like herd mentality and the blurring of lines between entertainment and financial speculation. Musk's return to finance, after his ouster from PayPal, hinges on balancing innovation with regulatory compliance.

marsbitHace 2 hora(s)

Kicked Out of PayPal, Musk Aims for a Comeback in the Crypto Market

marsbitHace 2 hora(s)

Trading

Spot

Futuros

Artículos destacados

Cómo comprar ONE

¡Bienvenido a HTX.com! Hemos hecho que comprar Harmony (ONE) sea simple y conveniente. Sigue nuestra guía paso a paso para iniciar tu viaje de criptos.Paso 1: crea tu cuenta HTXUtiliza tu correo electrónico o número de teléfono para registrarte y obtener una cuenta gratuita en HTX. Experimenta un proceso de registro sin complicaciones y desbloquea todas las funciones.Obtener mi cuentaPaso 2: ve a Comprar cripto y elige tu método de pagoTarjeta de crédito/débito: usa tu Visa o Mastercard para comprar Harmony (ONE) al instante.Saldo: utiliza fondos del saldo de tu cuenta HTX para tradear sin problemas.Terceros: hemos agregado métodos de pago populares como Google Pay y Apple Pay para mejorar la comodidad.P2P: tradear directamente con otros usuarios en HTX.Over-the-Counter (OTC): ofrecemos servicios personalizados y tipos de cambio competitivos para los traders.Paso 3: guarda tu Harmony (ONE)Después de comprar tu Harmony (ONE), guárdalo en tu cuenta HTX. Alternativamente, puedes enviarlo a otro lugar mediante transferencia blockchain o utilizarlo para tradear otras criptomonedas.Paso 4: tradear Harmony (ONE)Tradear fácilmente con Harmony (ONE) en HTX's mercado spot. Simplemente accede a tu cuenta, selecciona tu par de trading, ejecuta tus trades y monitorea en tiempo real. Ofrecemos una experiencia fácil de usar tanto para principiantes como para traders experimentados.

225 Vistas totalesPublicado en 2024.12.12Actualizado en 2025.03.21

Discusiones

Bienvenido a la comunidad de HTX. Aquí puedes mantenerte informado sobre los últimos desarrollos de la plataforma y acceder a análisis profesionales del mercado. A continuación se presentan las opiniones de los usuarios sobre el precio de ONE (ONE).