Task 1 Data
The Task 1 data consists of simulated hearing aid inputs that have been constructed using a set of real high-order ambisonic impulses that were recorded for the challenge. The scenes follow the construction used in the 2nd Clarity Enhancement Challenge and consist of a target sentence and either two or three interferers. The interferers can be speech, music or noise from domestic appliances in any combination. We have published a 2,500 scene development set. A further 1,500 scenes were recorded and have been set aside for evaluation. The training data from CEC2 is available for use in training.
The data is organised into the following directories, and can be obtained from the download page.
clarity_CEC3_data
|ββ manifest
|ββ task1
| βββclarity_data
| |ββ metadata
| |ββ train (use CEC2)
| βββ dev
| |ββ scenes
| βββ speaker_adapt
|ββ task2
βββ task3
The sections below describe the impulse response recording setup, the signal mixing and the format of the audio files and metadata provided for each scene.
Recording setupβ
The task uses a novel set of impulse responses that were recorded at the University of Salford using an mh acoustics em64 Eigenmike. These responses can then be used in place of the 6th order ambisonic impulse responses that were generated using room simulation in the previous challenge.
Recordings were made for 32 random configurations of a listener, a target and up to three interferers. Configurations were randomised in advanced and marked out on the floor of the recording room. For each configuration, the microphone is placed at the position of the listener and a loudspeaker is placed, in turn, at each of the sound source positions and directed towards the microphone. The sine-sweep method is then used to estimate the impulse response. The process is repeat for all 32 configurations and for target and three interferers, i.e. 32 x (3 + 1) = 128 impulse responses are recorded in total.
The room is an acoustically treated recording room with approximate dimensions of 5m x 5m x 2m. Some images are provided below for context.
- The recording room
- ...another view
- ... and another.



The room configurations were randomly generated by independently selecting the x and y coordinates of the listener, target and interferers. The positions were chosen uniformly at random within the dimensions of the room, excluding a 1m border around the walls. It was also imposed that no sound source should be within 1 m of the listener, i.e., samples were rejected and redrawn if this was the case. As with CEC2, for each configuration a height (z) was randomly chosen to be either 1.2m (simulating sitting) or 1.6m (simulating standing) and the microphone and loudspeakers were placed at this height.
The figure below shows the layouts of the 16 rooms used in the development data. The rooms are identified by the room number (R20001 to R20016) and the precise positions of the listener, target and interferers are given in the metadata files (see Section Metadata Formats).

The hearing aid signal simulationβ
The hearing aid input signals are simulated using the same processes that were previously used in the 2nd Clarity Enhancement Challenge. The only difference is that we replaced the simulated impulse responses with the real recordings. Key details are repeated here for completeness, and more information can be found on the CEC2 data pages.
Head rotationβ
The listener is initially oriented away from the target and will turn to be roughly facing the target talker around the time when the target speech starts
- Orientation of listener at start of the sample ~25Β° from facing the target (standard deviation = 5Β°), limited to +-2 standard deviations.
- Start of rotation is between -0.635 s to 0.865s (rectangular probability)
- The rotation lasts for 200 ms (standard deviation =10 ms)
- Orientation after rotation is 0-10Β° (random with rectangular probability distribution).
Signal-to-noise ratio (SNR)β
The SNR of the mixtures are engineered to achieve a suitable range of speech intelligibility values. A desired signal-to-noise ratio, SNR (dB), is chosen at random. This is generated with a uniform probability distribution between limits determined by pilot listening tests. The better ear SNR (BE_SNR) models the better ear effect in binaural listening. It is calculated for the reference channel (channel 1, which corresponds to the front microphone of the hearing aid). This value is used to scale all interferer channels. The procedure is described below.
For the reference channel,
- The segment of the summed interferers that overlaps with the target (without padding), , and the target (without padding), , are extracted
- Speech-weighted SNRs are calculated for each ear, SNR and SNR:
- Signals and are separately convolved with a speech-weighting filter, h (specified below).
- The rms is calculated for each convolved signal.
- SNR and SNR are calculated as the ratio of these rms values.
- The BE_SNR is selected as the maximum of the two SNRs: BE_SNR = max(SNR and SNR