Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitОпубліковано о 2026-03-30Востаннє оновлено о 2026-03-30

Анотація

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Трендові криптовалюти

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

PancakeSwapCAKE

JUSTJST

Пов'язані питання

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Пов'язані матеріали

LATEST NEWS: Donald Trump makes a sharp statement regarding Iran! He has halted attacks

U.S. President Donald Trump announced he called off planned military strikes against Iran after Saudi Arabia, the UAE, Qatar, and Iran itself requested a delay. Trump stated the planned operation would have been large-scale and powerful but was suspended to allow time for diplomatic negotiations. He added that regional allies believe an agreement is near, with initial talks focused on security and reopening the Strait of Hormuz. Negotiations on Iran's nuclear program would follow once that is settled. The Strait of Hormuz is a vital global chokepoint for oil and gas shipments, and conflict there could significantly impact energy prices and world trade. Trump further announced that new talks with Iran will begin tomorrow. Separately, Trump commented on events involving the Japanese yen, stating the U.S. intervened in the market due to good relations with Japan, asserting Washington's consistent support for Tokyo and mutual economic benefits from the relevant rules. *This is not an investment recommendation.

cryptonews.ru43 хв тому

LATEST NEWS: Donald Trump makes a sharp statement regarding Iran! He has halted attacks

cryptonews.ru43 хв тому

Bank of Italy Finds No Systemic Advantages of Stablecoins in Transfers

A study by the Bank of Italy found that stablecoins do not offer a consistent advantage in cost or speed for cross-border money transfers. The research compared sending 200 USDC in 10 bilateral corridors (Italy to Brazil, Argentina, Japan, UAE, and South Africa) against standard money transfer services. While the final cost of stablecoin transfers ranged from 0.3% to nearly 9%, and were often cheaper than the global average cost of 6.65%, they only outperformed services like Wise in three out of seven comparable corridors. Key costs and delays were attributed to fees for converting to and from fiat currency and the quality of local payment infrastructure, not blockchain fees. Transfer times varied from under 20 minutes in corridors with instant payment systems to one or two business days where such infrastructure was lacking. The authors concluded that stablecoins' benefits would be more significant if they could be spent directly without conversion and noted that overly restrictive regulations complicate retail use without eliminating demand.

cryptonews.ru1 год тому

Bank of Italy Finds No Systemic Advantages of Stablecoins in Transfers

cryptonews.ru1 год тому

Bitcoin Chart Pattern 'Head and Shoulders' Promises a Rise to $67,200

Bitcoin price action is forming a potential bullish reversal pattern. Currently trading around $63,200, BTC is shaping the right shoulder of an inverse head-and-shoulders formation. Analysts note this pattern is the primary reason for short-term bullish optimism, targeting a key breakout toward $67,200. However, market dynamics show a rotation of liquidity into Ethereum. The ETH/BTC pair has already broken upward, with ETH establishing an uptrend and targeting 0.0312. Against the US dollar, ETH is testing support near $1,875, with a path to $2,163 if it holds. This relative strength in ETH signals overall market positivity but drains volume from Bitcoin. The near-term outlook for BTC hinges on a decisive breakout above the pattern's neckline. Failure to do so could see bears push prices toward support levels at $60,000 and $58,000.

cryptonews.ru1 год тому

Bitcoin Chart Pattern 'Head and Shoulders' Promises a Rise to $67,200

cryptonews.ru1 год тому

Bitcoin Boom in Full Swing: Saylor's Latest Statement Fuels Buying Speculation

MicroStrategy's Executive Chairman Michael Saylor has fueled speculation about a new Bitcoin purchase by posting "Bitcoin Drive engaged" on August 2, accompanied by the company's customary purchase tracker. This aligns with his pattern of hinting at treasury changes ahead of weekly reports. The accompanying report showed MicroStrategy's Bitcoin holdings at 843,775 BTC, with an average cost of $75,653 per coin and an unrealized loss of -$10.58B. A similar signal preceded the company's July 27 announcement, strengthening expectations for a treasury update on Monday. However, MicroStrategy's real-time ledger reflects two recent Bitcoin sales totaling 3,588 BTC, reducing holdings from 847,363 BTC to the current 843,775 BTC. The company stated these sales funded preferred stock dividends and replenished its U.S. dollar reserve. Recent reports indicate the company made no Bitcoin purchases the week ending July 26 while increasing its dollar reserve to approximately $3.75B. The company faces financial headwinds after reporting an $8.33B operating loss for Q2 2026, including an $8.32B unrealized loss on its digital assets. Management may sell up to $1.25B more in Bitcoin to meet cash obligations. The expected Monday update will reveal if the "Bitcoin Drive" signal marks a return to accumulation as MicroStrategy balances its massive Bitcoin stash against growing cash commitments.

cryptonews.ru1 год тому

Bitcoin Boom in Full Swing: Saylor's Latest Statement Fuels Buying Speculation

cryptonews.ru1 год тому

AI Company Stocks Trading Like 'Memecoins' as Bitcoin Barely Moves — Weekly Review

Weekly Review Summary: AI stocks traded erratically like memecoins this week, while Bitcoin remained relatively flat around $64,000. Major market volatility stemmed from a forced liquidation by the "Situational Awareness" fund, contributing to a sharp sell-off in chip and AI stocks, particularly impacting Asian markets like South Korea's KOSPI. The Fed's signals and broader macro concerns added to the uncertainty. In crypto, the news was largely overshadowed by traditional finance but remained negative. Several crypto firms announced closures (BitMart) or bankruptcies (Storj Labs), and layoffs continued across the industry (Coinbase, Uphold). MicroStrategy notably shifted its strategy, using proceeds to buy back its own stock instead of purchasing more Bitcoin, drawing criticism. DeFi saw success stories like Trade.xyz on Hyperliquid, but also potential risks from insider trading and questions about platform dependence (e.g., Pump.fun on Solana). The intersection of AI and crypto gained attention with renewed hype around projects like Bittensor ($TAO). A critical security warning was reiterated for Coldcard wallet users regarding a potential private key vulnerability, emphasizing the high responsibility of self-custody. The overall tone cautioned against panic but urged preparedness amid a challenging market.

cryptonews.ru1 год тому

AI Company Stocks Trading Like 'Memecoins' as Bitcoin Barely Moves — Weekly Review

cryptonews.ru1 год тому

Торгівля

Спот

Обговорення

Ласкаво просимо до спільноти HTX. Тут ви можете бути в курсі останніх подій розвитку платформи та отримати доступ до професійної ринкової інформації. Нижче представлені думки користувачів щодо ціни ONE (ONE).

Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

Анотація

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

Трендові криптовалюти

Пов'язані питання

Пов'язані матеріали

LATEST NEWS: Donald Trump makes a sharp statement regarding Iran! He has halted attacks

Bank of Italy Finds No Systemic Advantages of Stablecoins in Transfers

Bitcoin Chart Pattern 'Head and Shoulders' Promises a Rise to $67,200

Bitcoin Boom in Full Swing: Saylor's Latest Statement Fuels Buying Speculation

AI Company Stocks Trading Like 'Memecoins' as Bitcoin Barely Moves — Weekly Review

Торгівля

Популярні статті

Як купити ONE

Обговорення