Microsoft Open-Sources Cutting-Edge Voice AI Family VibeVoice: Processes 90-Minute Multi-Speaker Conversations in One Go, Rapidly Gains 27K Stars on GitHub

marsbitPubblicato 2026-03-30Pubblicato ultima volta 2026-03-30

Introduzione

Microsoft has open-sourced VibeVoice, a cutting-edge family of speech AI models for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The project, gaining 27K stars on GitHub, offers powerful long-audio processing, multi-speaker dialogue generation, and real-time capabilities under an MIT license for local deployment. Key models include: - **VibeVoice-ASR-7B**: Processes up to 60 minutes of audio, outputs structured transcriptions with speaker identification, timestamps, and supports over 50 languages. - **VibeVoice-TTS-1.5B**: Generates expressive, 90-minute multi-speaker (up to 4 voices) conversations with natural flow and emotional nuance. - **VibeVoice-Realtime-0.5B**: Enables real-time TTS with ~300ms latency for interactive applications like voice assistants. The framework addresses limitations in long-sequence processing, speaker consistency, and naturalness. It includes safety features like audio watermarking and has sparked community-developed tools (e.g., a voice input method). Available on GitHub and Hugging Face, VibeVoice aims to advance innovation in content creation, accessibility, and voice interaction.

Microsoft recently open-sourced a cutting-edge voice AI model family named VibeVoice, which encompasses capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly garnered attention in the developer community due to its powerful long-audio processing, multi-speaker natural conversation generation, and real-time low-latency features. It has already gained approximately 27K Stars on GitHub.

As an open-source research framework, VibeVoice uses the MIT license, supports local deployment, requires no cloud subscription fees, and aims to promote collaboration and innovation in the field of speech synthesis. The model family mainly includes three core members, each with its own focus, collectively addressing the pain points of traditional voice AI in long-sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Structured Speech-to-Text Tool for Up to 60 Minutes

VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in one go, directly outputting structured transcription results. The output includes not only "who is speaking" (speaker identification) and "when they speak" (precise timestamps), but also "what was said" (detailed content), and supports custom hotwords to effectively improve the recognition accuracy of proper nouns or technical terms. The model supports over 50 languages and is suitable for complex scenarios like long meeting recordings and podcast transcriptions.

Community developers have already built practical tools based on this model, such as a voice input method named Vibing, which supports macOS and Windows platforms. User feedback indicates that its recognition speed and accuracy perform well, significantly improving daily voice input efficiency.

VibeVoice-TTS-1.5B: Expressive Speech Generation for 90-Minute Multi-Speaker Content

VibeVoice-TTS-1.5B is a core model focused on text-to-speech, capable of producing continuous audio up to 90 minutes long in a single generation, supporting natural dialogue simulation with up to 4 different speakers. The generated speech is expressive, sounds natural and fluent, and can simulate realistic pauses, emphasis, and emotional transitions, making it very suitable for producing podcasts, long-form audio narratives, audiobooks, or multi-character dialogue content.

Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has achieved significant breakthroughs in long-form, multi-speaker consistency. Its underlying architecture uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), greatly improving computational efficiency for long-sequence handling.

VibeVoice-Realtime-0.5B: Real-Time TTS with ~300ms Latency

VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input with an initial audio output latency of approximately 300 milliseconds, while also being able to generate long-form speech of about 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live broadcast dubbing scenarios.

Additionally, the project introduces experimental speaker support, including multilingual voices and various English style variants, providing developers with more customization options.

AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to using high-performance voice AI but also provides a complete solution for local deployment. The project was briefly taken down due to potential misuse risks but was later re-released with safety mechanisms such as embedded watermarks and audible disclaimers, reflecting the principles of responsible AI development. Currently, developers can obtain model weights on the GitHub repository and Hugging Face, and quickly try them out on platforms like Colab.

With continued contributions from the open-source community (such as optimized forks for Apple Silicon), VibeVoice is expected to accelerate adoption in fields like content creation, accessibility tools, and voice interaction. Interested developers can visit the official Microsoft project page to explore further.

Project address: https://github.com/microsoft/VibeVoice

Crypto di tendenza

Domande pertinenti

QWhat is the name of the open-source voice AI model family recently released by Microsoft, and how many stars has it received on GitHub?

AThe open-source voice AI model family is called VibeVoice, and it has received approximately 27,000 stars on GitHub.

QWhat are the three core models in the VibeVoice family and their primary capabilities?

AThe three core models are: 1) VibeVoice-ASR-7B, which handles automatic speech recognition for up to 60 minutes of audio; 2) VibeVoice-TTS-1.5B, which generates expressive speech for up to 90 minutes with multiple speakers; and 3) VibeVoice-Realtime-0.5B, which provides real-time text-to-speech with about 300ms latency.

QWhat is a key feature of the VibeVoice-ASR-7B model regarding its output?

AA key feature is its ability to output structured transcriptions that include speaker identification (who is speaking), precise timestamps (when they speak), and the detailed content (what was said).

QHow does the VibeVoice-TTS-1.5B model achieve efficient long-sequence processing?

AIt uses continuous speech tokenizers (acoustic and semantic tokenizers) combined with a low frame rate design (7.5Hz), which significantly improves computational efficiency for long-sequence processing.

QWhat safety measures were implemented in the VibeVoice project to address potential misuse risks?

AThe project implemented embedded audio watermarks and audible disclaimer mechanisms as safety measures to address potential misuse risks.

Letture associate

When the World Cup Collides with Agents: From Web2 to Web3, How Are Wallets Evolving into Agentic Wallets?

World Cup as a Catalyst for Agentic Wallets: From Web2 to Web3 This article explores how the World Cup provides a real-world scenario for observing the evolution of digital wallets from simple asset managers towards "Agentic Wallets"—intelligent, AI-powered interfaces. Using the example of prediction markets like Polymarket, it illustrates how AI Agents can lower the barrier to Web3 interaction. Instead of navigating complex DApps, users can express intent in natural language (e.g., "I think Portugal will win") within platforms like Discord or web pages. The Agent then interprets this intent, finds the relevant market, and seamlessly guides the user through the on-chain transaction via their wallet. The core shift is from wallets as mere "function menus" for signing transactions to "intent interpreters" that understand user goals. The article highlights parallel developments in traditional finance, such as Mastercard's "Agent Pay" and WeChat Pay's AI tests, which focus on granting AI controlled, authorized, and auditable payment capabilities. This underscores a broader trend of AI entering the financial layer. However, the article emphasizes that the primary challenge for Agentic Wallets in Web3 is not automation but establishing clear security boundaries. Unlike traditional systems with chargebacks, on-chain transactions are often irreversible. Therefore, future wallets must ensure users retain ultimate control and comprehension. They need to transparently communicate an Agent's permissions, spending limits, authorized durations, and provide easy ways to pause or revoke access. The World Cup experiments represent early steps toward wallets that are not just applications but ubiquitous, intelligent interfaces that simplify Web3 while keeping users securely in control.

marsbit32 min fa

When the World Cup Collides with Agents: From Web2 to Web3, How Are Wallets Evolving into Agentic Wallets?

marsbit32 min fa

Options Don't Work in DeFi? Vitalik Might Not Agree

For years, the prevailing view has been that options struggle to gain traction in DeFi due to complexity, fragmented liquidity, and lack of natural demand compared to products like perpetual futures. However, a recent algorithmic stablecoin design proposed by Vitalik Buterin presents a different perspective, using options not as a standalone trading product, but as foundational infrastructure for other financial instruments. In this design, one unit of ETH is split into two components: a "stable" side (P) that retains value up to a specified strike price, and an "upside" side (N) that captures all appreciation above that strike. Combined, they always equal one ETH, eliminating debt, margin, and liquidation risks inherent in typical collateralized debt position (CDP) stablecoins. The stable component essentially mimics the payoff of a covered call option. To function as a stablecoin, this structure requires continuously rolling deep in-the-money calls, which introduces challenges like rollover slippage, predictable transaction flow vulnerable to front-running, and persistent liquidity needs. A core hurdle is finding consistent buyers for the leveraged ETH upside exposure (N). While it offers leverage without funding rates or liquidation, it must compete with simpler alternatives like direct call options or perpetuals. The system's scalability depends on a sustained demand for this specific form of leverage. The author draws parallels to their experience with Rysk, where earlier versions of DeFi options protocols struggled. The breakthrough came with Rysk V12, which aligns incentives: asset holders generate yield by selling covered calls against their holdings, while market makers efficiently acquire the desired option exposure. This demonstrates that options can find product-market fit when embedded as a risk distribution and pricing engine within structured products, stablecoins, or yield-generating assets, rather than marketed as a complex direct trading instrument. Vitalik's proposal reinforces this architectural approach—using fully collateralized, non-custodial, and physically settled options as a fundamental building block. The real opportunity for options in DeFi may lie not in becoming the next perpetual swap, but in powering the next generation of on-chain financial products.

marsbit1 h fa

Options Don't Work in DeFi? Vitalik Might Not Agree

marsbit1 h fa

Conversation with Investor Zheng Di: MicroStrategy's Coin Sale Experiment, AI Economy, and Opportunities in US Stocks

Frontier tech investor Zheng "Didier" Di discusses the recent Bitcoin price drop, the financial strategy shift at MicroStrategy, the AI-driven surge in U.S. stocks, and the evolving role of crypto exchanges. Didier posits that the recent BTC decline stems less from macro factors or ETF outflows, and more from market repricing due to MicroStrategy's new financial structure. Following a wave of preferred stock and debt issuance (STRC, STRZ, etc.), MicroStrategy must now manage cash flow to pay dividends, potentially leading to a market expectation of sustained, small-scale BTC sales to maintain its "per-share bitcoin neutral" principle. Didier views this as a financial "experiment" testing market capacity for such recurring sell pressure, which, while creating near-term structural headwinds, likely avoids a true "death spiral" absent major new external shocks. Shifting to AI, Didier argues that tokens are becoming the new form of labor, with AI models and compute (tokenized inputs) increasingly replacing human roles in execution and middle-management. This drives enterprise efficiency and higher margins, fueling the sustained rally in U.S. semiconductor, data center, and infrastructure stocks. He foresees an emerging "machine economy" where automated agents transact and collaborate on-chain. Regarding crypto exchanges offering U.S. equities, Didier sees this as a natural evolution. With few crypto-native assets generating lasting value, exchanges are pivoting towards real-world assets (RWAs) like stocks and bonds. This doesn't necessarily cannibalize crypto but reflects a maturing industry focusing on blockchain's core utilities: decentralized choice and efficient settlement. He notes that trading logic for crypto natives doesn't need to drastically change, as meme-driven and fundamentalist strategies find analogs in U.S. markets. The "1011 event" (likely referring to a major market crash) severely damaged crypto market liquidity, marking a probable end to the altcoin speculative cycle, with capital flowing towards the deeper liquidity of U.S. markets. For the macro outlook, Didier is cautious about near-term market pressure from potential mega-IPOs (e.g., SpaceX) and the U.S. midterm elections, which could bring more regulatory scrutiny. Long-term, he remains bullish on AI's productivity gains and its convergence with blockchain/Web3, predicting a shift from speculative frenzy to a more institutionalized, industrial phase for the crypto sector.

marsbit1 h fa

Conversation with Investor Zheng Di: MicroStrategy's Coin Sale Experiment, AI Economy, and Opportunities in US Stocks

marsbit1 h fa

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

Playnance's native token, $GCOIN, has been listed on the cryptocurrency exchange KoinBX as of June 18. This move aims to enhance accessibility for its rapidly growing community, particularly in India, where the blockchain-powered Web3 iGaming ecosystem has gained significant traction. Over 130 partners in Playnance's "Be the Boss" program have built communities engaging thousands of active players in the region. The "Be the Boss" model allows participants to create and manage their own gaming communities, earning rewards tied to community activity. CEO Pini Peter noted India's high engagement, with community leaders successfully building player networks. One partner, Dr. Nicolas, reported earning over $57,000 through the program in recent months, highlighting both the financial rewards and the opportunity to grow an engaged community. $GCOIN serves as the ecosystem's core utility token, incentivizing participation and aligning the interests of players and community leaders ("Bosses"). The listing on KoinBX is part of Playnance's strategy to expand globally, increasing the token's utility and accessibility by combining community ownership, gamified engagement, and blockchain-based incentives. Founded in 2020, Playnance is a Web3 iGaming infrastructure company focused on creating live, non-custodial, on-chain products to onboard mainstream users. It currently processes approximately one million transactions daily, aiming to simplify the user experience while maintaining full on-chain transparency.

TheNewsCrypto2 h fa

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

TheNewsCrypto2 h fa

Trading

Spot
Futures

Articoli Popolari

Come comprare ONE

Benvenuto in HTX.com! Abbiamo reso l'acquisto di Harmony (ONE) semplice e conveniente. Segui la nostra guida passo passo per intraprendere il tuo viaggio nel mondo delle criptovalute.Step 1: Crea il tuo Account HTXUsa la tua email o numero di telefono per registrarti il tuo account gratuito su HTX. Vivi un'esperienza facile e sblocca tutte le funzionalità,Crea il mio accountStep 2: Vai in Acquista crypto e seleziona il tuo metodo di pagamentoCarta di credito/debito: utilizza la tua Visa o Mastercard per acquistare immediatamente HarmonyONE.Bilancio: Usa i fondi dal bilancio del tuo account HTX per fare trading senza problemi.Terze parti: abbiamo aggiunto metodi di pagamento molto utilizzati come Google Pay e Apple Pay per maggiore comodità.P2P: Fai trading direttamente con altri utenti HTX.Over-the-Counter (OTC): Offriamo servizi su misura e tassi di cambio competitivi per i trader.Step 3: Conserva Harmony (ONE)Dopo aver acquistato Harmony (ONE), conserva nel tuo account HTX. In alternativa, puoi inviare tramite trasferimento blockchain o scambiare per altre criptovalute.Step 4: Scambia Harmony (ONE)Scambia facilmente Harmony (ONE) nel mercato spot di HTX. Accedi al tuo account, seleziona la tua coppia di trading, esegui le tue operazioni e monitora in tempo reale. Offriamo un'esperienza user-friendly sia per chi ha appena iniziato che per i trader più esperti.

326 Totale visualizzazioniPubblicato il 2024.12.12Aggiornato il 2026.06.02

Come comprare ONE

Discussioni

Benvenuto nella Community HTX. Qui puoi rimanere informato sugli ultimi sviluppi della piattaforma e accedere ad approfondimenti esperti sul mercato. Le opinioni degli utenti sul prezzo di ONE ONE sono presentate come di seguito.

活动图片