Hear Me Out

Interactive evaluation and bias discovery platform for speech-to-speech conversational AI

Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

KTH Royal Institute of Technology

KTH Royal Institute of Technology, Stockholm, Sweden

🎙️ Click here to try Hear Me Out Live

Hear Me Out is an interactive evaluation and bias discovery platform for speech-to-speech conversational AI. Speech-to-speech models process spoken language directly from audio, without first converting it to text. They promise more natural, expressive, and emotionally aware interactions by retaining prosody, intonation, and other vocal cues throughout the conversation.


Hear Me Out Block Diagram

💻 Developing with Moshi using Modal for GPU hosting

1. Clone the Repository

First, you’ll need to get a copy of this project on your local machine. Open a terminal and run:

git clone https://github.com/shreeharsha-bs/Hear-Me-Out.git
cd Hear-Me-Out

2. Set Up Your Development Environment

Requirements

Setting up Voice Conversion (seed-VC)

The voice conversion functionality uses the seed-VC library. To set this up:

  1. Install the required dependencies for the local voice conversion server:

    pip install -r local_server_requirements.txt
    
  2. Start the local voice conversion server in one terminal:

    python local_vc_server.py
    
  3. In another terminal, start the Modal development server:

    modal serve -m src.app
    

This workflow allows the application to use local voice conversion capabilities (which run on your machine) while serving the main application through Modal.

While the modal serve process is running, changes to any of the project files will be automatically applied. Ctrl+C will stop the app.

Note that for frontend changes, the browser cache may need to be cleared. Or better yet, use incognito mode for every run.

If you want to deploy the app look at the instructions on Modal. You also get 30$ of free credits from them for now. You can deploy completely locally but that would require some changes to the code.


Features

Hear Me Out enables users to experience interactions with conversational models in ways that aren’t typically accessible with regular benchmarking systems. Key features include:

Hear Me Out Interface Screenshot

Through this immersive experience, we hope users will gain insights into identity, voice, and AI behavior. Ultimately, we aim to surface meaningful questions and inspire future research that promotes fairness and inclusivity with Hear Me Out.


Demo Video

In the demo video, we explore the Moshi speech-to-speech model and its responses:

Example 1: Emotional Awareness

Notice how the model disambiguates between inputs with levity and frustration, correctly reflecting the speaker’s emotional state in its responses. This distinction adds a more human-like quality to the interaction.

Example 2: Voice Conversion - Gender Bias requesting unauthorized access

By applying voice transformations, we simulate how the model might respond to different speaker characteristics. While the differences in these responses are more subtle and inconsistent under repetition, hearing oneself in another voice opens up new perspectives.

Example 3: Voice Conversion - Gender Bias at Work

📄 License

This project is licensed under the terms specified in the LICENSE file.

🤝 Collaborations

We welcome contributions and collaboration. If you're in HCI, please reach out.


Explore Empathy and Conversational AI with Hear Me Out

🎙️ Try it now