Data Specification

The CPC3 dataset is derived from the first three Clarity Enhancement Challenges (CEC1, CEC2, and CEC3). It includes processed signals representing the outputs of participant-submitted systems, along with listener responses collected during system evaluations.

info

To download the data, please visit the download page.

Data Distribution and Installation

The data is distributed as two gzipped tar archives:

clarity_CPC3_data.v1_0.tar.gz (labelled training data, 7GB)
clarity_CPC3_data.dev.v1_0.tar.gz (unlabelled development set, 765MB)

Installation Instructions:

Download both .tar.gz files.
Unpack each archive using the following commands:
```
tar -xvzf clarity_CPC3_data.v1_0.tar.gz  # For training data
tar -xvzf clarity_CPC3_data.dev.v1_0.tar.gz # For development data
```
Alternatively, you can unpack them in two steps:
```
gunzip clarity_CPC3_data.v1_0.tar.gz
tar -xvf clarity_CPC3_data.v1_0.tar
```
Note: You might encounter warnings about file name conflicts during unpacking. These can be safely ignored.

Directory Structure After Unpacking:

clarity_CPC3_data/
├── clarity_data/
│   ├── metadata/  # Listener responses and characteristics
│   ├── train/
│   │   ├── references/  # Reference signals for intelligibility prediction
│   │   └── signals/     # Hearing aid output signals
│   └── dev/
│       ├── references/
│       └── signals/
└── manifest/

Dataset Overview

The dataset is divided into two parts:

Training Data: Contains audio signals and corresponding listener responses, enabling model training.
Development Data: Provides audio signals without listener responses. This set can be used for system development and evaluation - we will be opening a leaderboard to which you can submit development set predictions.

Key Differences Between CECs:

The training data is derived from the 1st and 2nd Clarity Enhancement Challenges (CEC1, CEC2). Development and final evaluation will be on data from the 3rd Clarity Enhancement Challenge (CEC3). The CECs differ in the following ways:

CEC1: Focused on simple scenes with a single interferer.
CEC2: Involved more complex scenes with multiple interferers.
CEC3: Based on CEC2 but a mix of simulated scenes and real-world recordings (see the CEC3 website for more details).

Training Data Details

Hearing Aid Output Signals (`clarity_data/train/signals/`)

Stored as 16-bit stereo WAV files at 32 kHz.
Filenames follow the format: <CEC_ID>_<SYSTEM_ID>_<SCENE_ID>_<LISTENER_ID>.wav
- <CEC_ID>: CEC1 or CEC2.
- <SYSTEM_ID>: Hearing aid system identifier.
- <SCENE_ID>: Scene identifier.
- <LISTENER_ID>: Listener identifier.
- Example: CEC1_E010_S09673_L0217.wav

Reference Signals (`clarity_data/train/references/`)

Represent non-reverberant target speech, aligned to match the hearing aid input.
Filenames: <CEC_ID>_<SCENE_ID>_ref.wav
- <CEC_ID>: CEC1 or CEC2.
- <SCENE_ID>: Scene identifier.
One reference signal corresponds to multiple hearing aid outputs from the same scene.
Note: Slight misalignment can occur due to hearing aid processing.
Note: When using the reference signals do not rely on them being scaled to the same level as the target signal at the hearing aid input. This won't be the case for the development and final evaluation sets.

Metadata (`clarity_data/metadata/`)

Contains listener responses (CPC3.train.json) and listener characteristics (listeners.csv).

Listener Responses (`CPC3.train.json`)

JSON file with a list of dictionaries, each representing a listener's response.
Fields:
- signal: Hearing aid output filename.
- prompt: Original target sentence.
- response: Listener's transcribed response ("#" indicates no response/understanding).
- n_words: Number of words in prompt.
- words_correct: Number of correctly identified words
- correctness: Percentage of correctly identified words (the target variable).
  CPC3.train.json
```
[
    // ...
    {
        "signal": "CEC1_E001_S08594_L0231",
        "prompt": "I can't keep my cool and he does rattle me",
        "response": "I can't keep my ",
        "n_words": 10,
        "words_correct": 4,
        "correctness": 40.0
    },
    // ...
]
```

Listener Characteristics (`listeners.csv`)

CSV file with listener IDs and hearing impairment severity (WHO 2021 standard).
listeners.csv
```
listener_id,severity
L0200,Moderate
L0201,Mild
L0202,Moderate
L0207,Moderate
// ...
```

Development Data Details

Derived from CEC3, containing 926 audio signals for system development.
Smaller preview of the final evaluation set (approximately 7,000 signals).

Metadata (CPC3.dev.json)

CPC3.dev.json
[
    {
        "signal": "fb65e85e504ae136db0184d4",
        "prompt": "you don't like coming to the refuge do you",
        "n_words": 9,
        "hearing_loss": "Mild"
    },
    // ...
]

signal: Encoded identifier for hearing aid output and reference.
prompt: Original target sentence.
n_words: Number of words in prompt.
hearing_loss: Listener's hearing impairment severity.

Audio Signals (clarity_data/dev/signals/ and clarity_data/dev/references/)

16-bit stereo WAV files at 32 kHz.

Leaderboard and Submission

A leaderboard will open in mid-April.
Submission instructions will be provided then.
The final evaluation set will be used for final ranking.

Important Notes:

signal identifiers are encoded to prevent data leakage.
hearing_loss is provided directly in CPC3.dev.json.
The development set is significantly smaller than the final evaluation set.

Data Distribution and Installation​

Dataset Overview​

Training Data Details​

Hearing Aid Output Signals (clarity_data/train/signals/)​

Reference Signals (clarity_data/train/references/)​

Metadata (clarity_data/metadata/)​

Listener Responses (CPC3.train.json)​

Listener Characteristics (listeners.csv)​

Development Data Details​