Hear Me Out
Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

KTH Royal Institute of Technology, Stockholm, Sweden
Hear Me Out is an interactive evaluation and bias discovery platform for speech-to-speech conversational AI. Speech-to-speech models process spoken language directly from audio, without first converting it to text. They promise more natural, expressive, and emotionally aware interactions by retaining prosody, intonation, and other vocal cues throughout the conversation.
💻 Developing with Moshi using Modal for GPU hosting
1. Clone the Repository
First, you’ll need to get a copy of this project on your local machine. Open a terminal and run:
git clone https://github.com/shreeharsha-bs/Hear-Me-Out.git
cd Hear-Me-Out
2. Set Up Your Development Environment
Requirements
modal
installed in your current Python virtual environment (pip install modal
)- A Modal account (
modal setup
) - A Modal token set up in your environment (
modal token new
)
Setting up Voice Conversion (seed-VC)
The voice conversion functionality uses the seed-VC library. To set this up:
-
Install the required dependencies for the local voice conversion server:
pip install -r local_server_requirements.txt
-
Start the local voice conversion server in one terminal:
python local_vc_server.py
-
In another terminal, start the Modal development server:
modal serve -m src.app
This workflow allows the application to use local voice conversion capabilities (which run on your machine) while serving the main application through Modal.
While the modal serve
process is running, changes to any of the project files will be automatically applied. Ctrl+C will stop the app.
Note that for frontend changes, the browser cache may need to be cleared. Or better yet, use incognito mode for every run.
If you want to deploy the app look at the instructions on Modal. You also get 30$ of free credits from them for now. You can deploy completely locally but that would require some changes to the code.
Features
Hear Me Out enables users to experience interactions with conversational models in ways that aren’t typically accessible with regular benchmarking systems. Key features include:
- 🎤 Speech-to-Speech Models: Users can choose from a variety of models that retain vocal cues like prosody and intonation.
- 🔄 Real-Time Voice Conversion: Step into someone else’s voice – literally – and investigate how conversational AI systems interpret and respond to various speaker identities and expressions.
- ⚖️ Side-by-Side Comparisons: Ask a question with your own voice, then re-ask using a transformed voice. Compare the AI’s responses to observe differences in tone, phrasing, or behavior.
- 📊 Insights Through Data: Visualize metrics like speech rate, sentiment analysis, and more.
Through this immersive experience, we hope users will gain insights into identity, voice, and AI behavior. Ultimately, we aim to surface meaningful questions and inspire future research that promotes fairness and inclusivity with Hear Me Out.
Demo Video
In the demo video, we explore the Moshi speech-to-speech model and its responses:
Example 1: Emotional Awareness
Notice how the model disambiguates between inputs with levity and frustration, correctly reflecting the speaker’s emotional state in its responses. This distinction adds a more human-like quality to the interaction.
Example 2: Voice Conversion - Gender Bias requesting unauthorized access
By applying voice transformations, we simulate how the model might respond to different speaker characteristics. While the differences in these responses are more subtle and inconsistent under repetition, hearing oneself in another voice opens up new perspectives.
Example 3: Voice Conversion - Gender Bias at Work
📄 License
This project is licensed under the terms specified in the LICENSE file.
🤝 Collaborations
We welcome contributions and collaboration. If you're in HCI, please reach out.
Explore Empathy and Conversational AI with Hear Me Out