AI brings forth human emotions
#Industry information ·2025-04-16 16:26:56
The open-source speech model Orpheus enables LLMS to evoke human emotions! On the A100 40GB graphics card, the streaming inference speed of the 3-billion-parameter model even exceeds the audio playback speed. Even zero-shot cloning of sounds is possible.
What other capabilities can large language models (LLMS) generate?
This open-source model Orpheus has directly brought human emotions to LLMS!
In this regard, Elias, an open-source developer of Canopy Labs, said that Orpheus, like a human being, already has the ability to empathize and can generate potential clues from the text, such as sighs, laughter and scoffs.
As an open-source text-to-speech (TTS) model, Orpheus outperforms all open-source/closed-source models including ElevenLabs and OpenAI!
Orpheus successfully demonstrated the emergent capabilities of LLMS in the field of speech synthesis.
Orpheus demonstrated the ability to empathize, with emotional intelligence comparable to that of humans. It could even generate latent tones such as sighs, laughter, and chuckles from the text itself.
For a long time, open-source TTS models have been unable to compete with closed-source models. However, today, this situation is beginning to change. Ophueus is disrupting the voice industry!
The newly open-source Orpheus has four major features:
Anthropomorphic speech: It features natural intonation, emotion and rhythm, and its effect is superior to the current most advanced (SOTA) closed-source models.
Zero-shot speech cloning: Clone sounds without additional fine-tuning.
Controllable emotions and intonation: The emotions and features of the voice can be adjusted using simple labels.
Low latency: The latency of stream inference is approximately 200ms. Combined with input stream processing, it can be reduced to 100ms, making it suitable for real-time applications.
Streaming inference can gradually output results during the audio generation process, with extremely low latency, making it suitable for real-time applications.
On the A100 40GB graphics card, the streaming inference speed of the 3-billion-parameter model is even faster than the audio playback speed.
Address of the project: https://github.com/canopyai/Orpheus-TTS model address: https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2
Four major models
Orpheus is a series composed of multiple pre-trained and fine-tuned models, with 3 billion parameters.
In the coming days, developers will release smaller-scale models, including versions with 1 billion, 500 million and 150 million parameters.
Based on the Llama architecture, open-source developers will also release pre-trained and fine-tuned models, offering four different scales:
Medium - 3 billion parameters
Small - 1 billion parameters
Tiny - 400 million parameters
Nano - 150 million parameters
Even at an extremely small model scale, it can still achieve extremely high-quality and aesthetically pleasing speech generation.
Fine-tuned models are suitable for dialogue scenarios, while pre-trained models can be used for a variety of downstream tasks, such as speech cloning or speech classification.
Model architecture and design
The pre-trained model uses Llama-3B as the infrastructure and has been trained on over 100,000 hours of English speech data and billions of text tokens.
By training text tokens, the model's performance on TTS tasks has been significantly improved, endowing it with stronger language understanding capabilities.
Due to the adoption of the LLM architecture, the model features high precision, strong expressiveness and high customizability.
The new model supports real-time voice output stream inference with a latency as low as approximately 200 milliseconds, making it suitable for conversational applications.
If you want to further reduce the latency, you can stream the text into the model's KV cache, thereby reducing the latency to approximately 25-50 milliseconds.
In the design of real-time voice, two unconventional methods were adopted: a tokenizer based on CNN
Sample tokens of different frequencies using Snac and flatten them
Each frame generates 7 tokens and decodes them as a single flattening sequence instead of using 7 LM headers for decoding.
In this way, the number of steps that the model needs to generate increases. However, on A100 or H100 Gpus, after implementation with vLLM, the token generation speed of the model is still faster than real-time playback. Therefore, even for longer speech sequences, real-time generation can be maintained.
Orpheus adopts a non-streaming (CNN-based) tokenizer.
Other speech LLMS that use SNAC as the decoder will experience a "bounce (popping)" phenomenon between frames during detokenization.
Orpheus improved the de-tokenization implementation through a sliding window to support streaming inference while completely eliminating the popping problem.
Usage Tutorial
This release includes three models.
In addition, data processing scripts and sample datasets are provided to facilitate users' easy custom fine-tuning.
At present, there are two models in total:
Finetuned Prod: A high-quality model fine-tuned for daily TTS applications, a fine-tuned model suitable for daily TTS applications.
Pretrained: Pre-trained basic model, based on over 100,000 hours of English speech data, is preset to conditional generation mode and can be extended to more tasks.
Streaming reasoning
1. Clone repository
- git clone https://github.com/canopyai/Orpheus-TTS.git
2. Install dependencies
- cd Orpheus-TTS && pip install orpheus-speech # uses vllm under the hood for fast inference
- pip install vllm==0.7.3
3. Run a streaming inference example
- from orpheus_tts import OrpheusModel
- import wave
- import time
- model= OrpheusModel(model_name ="canopylabs/orpheus-tts-0.1-finetune-prod")
- prompt = '''Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we're all connected 24/7 but somehow people feel more alone than ever. And don't even get me started on how it's messing with kids' self-esteem and mental health and whatnot.'''
- start_time = time.monotonic()
- syn_tokens = model.generate_speech(
- prompt=prompt,
- voice="tara",
- )
- with wave.open("output.wav", "wb") as wf:
- wf.setnchannels(1)
- wf.setsampwidth(2)
- wf.setframerate(24000)
- total_frames = 0
- chunk_counter = 0
- for audio_chunk in syn_tokens: # output streaming
- chunk_counter += 1
- frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())
- total_frames += frame_count
- wf.writeframes(audio_chunk)
- duration = total_frames / wf.getframerate()
- end_time = time.monotonic()
- print(f"It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio")
Prompt format
Fine-tuning the model
The main text prompt format is:
- {name}: I went to the ...
Optional names (sorted by the naturalness of the conversation, subjective assessment) : "tara", "leah", "jess", "leo", "dan", "mia", "zac", "zoe".
Emotional tags can be added
<laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
The Python packages orpheus-speech and Notebook will automatically provide formatting prompts without the need for manual adjustments.
Pre-trained model
It is applicable to generating speech only based on text or based on one or more existing text-speech pairs.
Zero-shot speech cloning: This model has not been specially trained, so the more text-speech pairs are input, the more stable the effect of generating the target sound will be.
The following parameter adjustments are applicable to all models:
Conventional LLM generation parameters: Supports temperature, top_p, etc.
Avoiding repetition: repetition_penalty >= 1.1 can enhance stability.
Speed adjustment: Increasing repetition_penalty and temperature will make the speaking speed faster.
Model fine-tuning
The following is an overview of how to fine-tune the model for any text and speech.
This process is very simple, similar to using Trainer and Transformers to fine-tune LLMS (Large Language Models).
After approximately 50 samples, high-quality results should start to be observed, but for the best results, it is recommended that each person provide 300 samples.
The first step: The dataset should be a Hugging Face dataset in the following format:
Step 2: Use Colab Notebook to prepare the data.
This will push an intermediate dataset to Hugging Face, but it can be input into the training script in finetune/train.py.
The preprocessing is estimated to take less than one minute for every thousand rows of data.
Step 3: Modify the finetune/config.yaml file to include the new dataset and training attributes, and then run the training script.
Any Hugging Face compatible process, such as Lora, can also be run to further fine-tune the model.
- pip install transformers datasets wandb trl flash_attn torch
- huggingface-cli login <enter your HF token>
- wandb login <wandb token>
- accelerate launch train.py
This is just one of the many technologies developed by Canopy Labs.
They believe that in the future, every AI application will transform into a "digital human" capable of interacting with people.
Reference materials
https://canopylabs.ai/model-releases
https://x.com/Eliasfiz/status/1902435597954003174
https://x.com/shao__meng/status/1902504856277189027
This article is from the wechat official account "Xinzhiyuan";.
Tags: