An explanation of the time and computational limits for the first round of the enhancement challenge.
Enhancement challenge 2021
For a hearing aid to work well for users, the processing needs to be quick. The output of the hearing aid should be produced with a delay of less than about 10 ms. Many audio processing techniques are non-causal, i.e., the output of the system depends on samples from the future. Such processing is useless for hearing aids and therefore our rules include a restriction on the use of future samples.
The rules state the following:
Systems must be causal; the output at time t must not use any information from input samples more than 5 ms into the future (i.e., no information from input samples >t+5ms).
There is no limit on computational cost.
Mathematically this is:
where yn is the output from your hearing aid for sample n.
x is the audio input signal from a hearing aid microphone.
N = 0.005 fs where fs is the sampling frequency.
m is a sample number where m <= n.
L is the listener characteristics.
f() is the hearing aid function. There is no limitation on how long this takes to compute.
You can use multiple microphones; only a single input signal x is shown here just for simplicity.
Here it is illustrated as a diagram.
We have a chosen a limit of 5 ms because in a real hearing aid there will be other sources of delay (e.g., analogue-to-digital, digital-to-analogue conversion).
Why is there no limitation of how long f() takes to compute?
We’re trying to foster new approaches to hearing aid processing and decided that at this stage we will drive more innovation if we don’t restrict computation time for round one. Such restrictions will be considered in future rounds.
Why haven’t you talked about latency?
In discussions, it is apparent that this term is used in different ways by different people, so to avoid confusion we’re not using it!
Do algorithms have to be real-time?
The above limitations mean that the algorithms could in theory be made real-time if a powerful enough computer was available, but your entry can take as long as it needs to process the signals.
Improving hearing aid processing using DNNs blog. A suggested approach to overcome the non-differentiable loss function.
The aim of our Enhancement Challenge is to get people producing new algorithms for processing speech signals through hearing aids. We expect most entries to replace the classic hearing aid processing of Dynamic Range Compressors (DRCs) with deep neural networks (DNN) (although all approaches are welcome!). The first round of the challenge is going to be all about improving speech intelligibility.
Setting up a DNN structure and training regime for the task is not as straightforward as it might first appear. Figure 1 shows an example of a naive training regime. An audio example of Speech in Noise (SPIN) is randomly created (audio sample generation, bottom left), and a listener is randomly selected with particular hearing loss characteristics (random artificial listener generation, top left). The DNN Enhancement model (represented by the bright yellow box) then produces improved speech in noise. (Audio signals in pink are two-channel, left and right because this is for binaural hearing aids.)
Next the improved speech in noise is passed to the Prediction Model in the lime green box, and this gives an estimation of the Speech Intelligibility (SI). Our baseline system will include algorithms for this. We’ve already blogged about the Hearing Loss Simulation. Our current thinking is that the intelligibility model will be using a binaural form of the Short-Time Objective Intelligibility Index (STOI) . The dashed line going back to the enhancement model shows that the DNN will be updated based on the reciprocal of the Speech Intelligibility (SI) score. By minimising (1/SI), the enhancement model will be maximising intelligibility.
The difficulty here is that updating the Enhancement Model DNN during training requires the error to be known at the DNN’s output (the point labelled “improved SPIN”). But we don’t know this, we only know the error on the output of the prediction model at the far right of the diagram. This wouldn’t be a problem if the prediction model could be inverted, because we could then run the 1/SI error backwards through the inverse model.
As the inverse of the prediction model isn’t available, one solution is to train another DNN to mimic its behaviour (Figure 2). As this new Prediction Model is a DNN, the 1/SI error can be passed backwards through it using standard neural network training formulations.
This DNN prediction model could be trained first using knowledge distillation (this is something I’ve previous done for a speech intelligibility model), and then the weights frozen while the Enhancement Model is trained. But there is a ‘chicken and egg’ problem here. The difficulty is generating all the training data for the prediction model. Until you train the enhancement model, you won’t have a representative examples of “improved SPIN” to train the prediction model. But without the prediction model, you can’t train the enhancement model.
One solution is to train the two DNNs in tandem, with an approach analogous to how pairs of networks are trained in a Generative Adversarial Network (GAN). iMetricGan developed by Li et al.  is an example of this being done for speech enhancement, although the authors weren’t trying to include hearing loss simulation. They aren’t the only ones looking at trying to solve problems where a non-differentiable or black-box evaluation function is in the way of DNN training .
We hope our entrants will come up with lots of other ways of overcoming this problem. How would you tackle it?
 Andersen, A.H., Haan, J.M.D., Tan, Z.H. and Jensen, J., 2015. A binaural short time objective intelligibility measure for noisy and enhanced speech. In the Sixteenth Annual Conference of the International Speech Communication Association.
 Li, H., Fu, S.W., Tsao, Y. and Yamagishi, J., 2020. iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning. arXiv preprint arXiv:2004.00932.
 Gillhofer, M., Ramsauer, H., Brandstetter, J., Schäfl, B. and Hochreiter, S., 2019. A GAN based solver of black-box inverse problems. Proceedings of the NeurIPS 2019 Workshop.
 Kawanaka, M., Koizumi, Y., Miyazaki, R. and Yatabe, K., 2020, May. Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7524-7528). IEEE.
We’ll be challenging our contestants to find innovative ways of making speech more audible for hearing impaired listeners when there is noise getting in the way. But what noises should we consider? To aid us in choosing sounds and situations that are relevant to people with hearing aids, we held a focus group.
We wanted to know about
Everyday background noises that make having a conversation difficult.
The characteristics of speech after it has been processed by a hearing-aid that hearing aid listeners would value.
A total of eight patients (four males, four females) attended the meeting, six of whom were recruited from the Nottingham Biomedical Research Centre’s patient and public involvement contact list. Two attendees were recruited from a local lip reading class organised by the Nottinghamshire Deaf Society. The range of hearing loss within the group is from mild to severe. They all regularly use bilateral hearing aids.
Our focus was on the living room because that is the scenario for round one of the challenges.
Everyday background noises that interfere with understanding of speech
A long and varied list of sounds cause problems. These lists are in no particular order.
An overview of why machine learning challenges have potential to improve hearing aid signal processing.
The Clarity Project is based around the idea that machine learning challenges could improve hearing aid signal processing. After all this has happened in other areas, such as automatic speech recognition (ASR) in the presence of noise. The improvements in ASR have happened because of:
Machine learning (ML) at scale – big data and raw GPU power.
Benchmarking – research has developed around community-organised evaluations or challenges.
Collaboration has been enabled by these challenges, allowing working across communities such as signal processing, acoustic modelling, language modelling and machine learning
We’re hoping that these three mechanisms can drive improvements in hearing aids.
Components of a challenge
There needs to be a common task based on a target application scenario to allow communities to gain from benchmarking and collaboration. Clarity project’s first enhancement challenge will be about hearing speech from a single talker in a typical living room, where there is one source of noise and a little reverberation.
Entrants are then given training data and development (dev) test data along with a baseline system that represents the current state-of-the-art. You can find a post and video on the current thinking on the baseline here. We’re still working on the rules stipulating what is and what is not allowed (for example, will entrants be allowed to use data from outside the challenge).
Clarity’s first enhancement challenge is focussed on maximising the speech intelligibility (SI) score. We will evaluate this first through a prediciton model that is based on a hearing loss simulation and an objective metric for speech intellibility. Simulation has been hugely important for generating training data in the CHIME challenges and so we intend to use that approach in Clarity. But results from simulated test sets cannot be trusted and hence a second evaluation will come through perceptual tests on hearing impaired subjects. However, one of our current problems is that we can’t bring listeners into our labs because of COVID-19.
We’ll actually be running two challenges in roughly parallel, because we’re also going to task the community to improve our prediction model for speech intelligibility.
We’re running a series of challenges over five years. What other scenarios should we consider? What speech? What noise? What environment? Please comment below.