## Latency, computation time and real-time operation

An explanation of the time and computational limits for the first round of the enhancement challenge.

## Enhancement challenge 2021

For a hearing aid to work well for users, the processing needs to be quick. The output of the hearing aid should be produced with a delay of less than about 10 ms. Many audio processing techniques are non-causal, i.e., the output of the system depends on samples from the future. Such processing is useless for hearing aids and therefore our rules include a restriction on the use of future samples.

The rules state the following:

• Systems must be causal; the output at time t must not use any information from input samples more than 5 ms into the future (i.e., no information from input samples >t+5ms).
• There is no limit on computational cost.

Mathematically this is:

• where yn is the output from your hearing aid for sample n.
• x is the audio input signal from a hearing aid microphone.
• N = 0.005 fs where fs is the sampling frequency.
• m is a sample number where m <= n.
• L is the listener characteristics.
• f() is the hearing aid function. There is no limitation on how long this takes to compute.
• You can use multiple microphones; only a single input signal x is shown here just for simplicity.

Here it is illustrated as a diagram.

We have a chosen a limit of 5 ms because in a real hearing aid there will be other sources of delay (e.g., analogue-to-digital, digital-to-analogue conversion).

## Why is there no limitation of how long f() takes to compute?

We’re trying to foster new approaches to hearing aid processing and decided that at this stage we will drive more innovation if we don’t restrict computation time for round one. Such restrictions will be considered in future rounds.

## Why haven’t you talked about latency?

In discussions, it is apparent that this term is used in different ways by different people, so to avoid confusion we’re not using it!

## Do algorithms have to be real-time?

The above limitations mean that the algorithms could in theory be made real-time if a powerful enough computer was available, but your entry can take as long as it needs to process the signals.

## One approach to our enhancement challenge

Improving hearing aid processing using DNNs blog. A suggested approach to overcome the non-differentiable loss function.

The aim of our Enhancement Challenge is to get people producing new algorithms for processing speech signals through hearing aids. We expect most entries to replace the classic hearing aid processing of Dynamic Range Compressors (DRCs) with deep neural networks (DNN) (although all approaches are welcome!). The first round of the challenge is going to be all about improving speech intelligibility.

Setting up a DNN structure and training regime for the task is not as straightforward as it might first appear. Figure 1 shows an example of a naive training regime. An audio example of Speech in Noise (SPIN) is randomly created (audio sample generation, bottom left), and a listener is randomly selected with particular hearing loss characteristics (random artificial listener generation, top left). The DNN Enhancement model (represented by the bright yellow box) then produces improved speech in noise. (Audio signals in pink are two-channel, left and right because this is for binaural hearing aids.)

Next the improved speech in noise is passed to the Prediction Model in the lime green box, and this gives an estimation of the Speech Intelligibility (SI). Our baseline system will include algorithms for this. We’ve already blogged about the Hearing Loss Simulation. Our current thinking is that the intelligibility model will be using a binaural form of the Short-Time Objective Intelligibility Index (STOI) [1]. The dashed line going back to the enhancement model shows that the DNN will be updated based on the reciprocal of the Speech Intelligibility (SI) score. By minimising (1/SI), the enhancement model will be maximising intelligibility.

The difficulty here is that updating the Enhancement Model DNN during training requires the error to be known at the DNN’s output (the point labelled “improved SPIN”). But we don’t know this, we only know the error on the output of the prediction model at the far right of the diagram. This wouldn’t be a problem if the prediction model could be inverted, because we could then run the 1/SI error backwards through the inverse model.

As the inverse of the prediction model isn’t available, one solution is to train another DNN to mimic its behaviour (Figure 2). As this new Prediction Model is a DNN, the 1/SI error can be passed backwards through it using standard neural network training formulations.

This DNN prediction model could be trained first using knowledge distillation (this is something I’ve previous done for a speech intelligibility model), and then the weights frozen while the Enhancement Model is trained. But there is a ‘chicken and egg’ problem here. The difficulty is generating all the training data for the prediction model. Until you train the enhancement model, you won’t have a representative examples of “improved SPIN” to train the prediction model. But without the prediction model, you can’t train the enhancement model.

One solution is to train the two DNNs in tandem, with an approach analogous to how pairs of networks are trained in a Generative Adversarial Network (GAN). iMetricGan developed by Li et al. [2] is an example of this being done for speech enhancement, although the authors weren’t trying to include hearing loss simulation. They aren’t the only ones looking at trying to solve problems where a non-differentiable or black-box evaluation function is in the way of DNN training [3][4].

We hope our entrants will come up with lots of other ways of overcoming this problem. How would you tackle it?

## References

[1] Andersen, A.H., Haan, J.M.D., Tan, Z.H. and Jensen, J., 2015. A binaural short time objective intelligibility measure for noisy and enhanced speech. In the Sixteenth Annual Conference of the International Speech Communication Association.

[2] Li, H., Fu, S.W., Tsao, Y. and Yamagishi, J., 2020. iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning. arXiv preprint arXiv:2004.00932.

[3] Gillhofer, M., Ramsauer, H., Brandstetter, J., Schäfl, B. and Hochreiter, S., 2019. A GAN based solver of black-box inverse problems. Proceedings of the NeurIPS 2019 Workshop.

[4] Kawanaka, M., Koizumi, Y., Miyazaki, R. and Yatabe, K., 2020, May. Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7524-7528). IEEE.

## Sounds for round one

We’ll be challenging our contestants to find innovative ways of making speech more audible for hearing impaired listeners when there is noise getting in the way. But what noises should we consider? To aid us in choosing sounds and situations that are relevant to people with hearing aids, we held a focus group.

• Everyday background noises that make having a conversation difficult.
• The characteristics of speech after it has been processed by a hearing-aid that hearing aid listeners would value.

A total of eight patients (four males, four females) attended the meeting, six of whom were recruited from the Nottingham Biomedical Research Centre’s patient and public involvement contact list. Two attendees were recruited from a local lip reading class organised by the Nottinghamshire Deaf Society. The range of hearing loss within the group is from mild to severe. They all regularly use bilateral hearing aids.

Our focus was on the living room because that is the scenario for round one of the challenges.

## Everyday background noises that interfere with understanding of speech

A long and varied list of sounds cause problems. These lists are in no particular order.

### Living room or space

• Clocks ticking
• Crisp packets rustling
• Taps running
• Kettles boiling
• Dishwasher
• Microwave
• Washing machine
• Phone ringing (or receiving texts – unknown beeps/tones)
• Newspapers rustling
• Air-conditioning and oven extractor fans
• Vacuum cleaner
• Doorbell ringing
• Dog barking
• Rain on window

### Family and friends

• Cutlery/crockery banging/clanging
• Doors opening/closing (to rooms and cupboards)
• Music
• People walking around the room
• Children playing with toys
• Laughing
• People talking from another room
• Speakers from a different conversation in close proximity (i.e. beside you) when you are trying to converse
• Traffic outside
• Chewing/chomping
• Steam pipes/ coffee machines
• Chairs being moved

### Outside

• Church bells
• Market noise
• Footsteps on different types of ground, i.e. heels on hard floors but also wellingtons in mud
• Clothes rustling (such as waterproof coats or hat on hearing aid)
• Wind (even with HA on ‘wind setting’)
• Pigeons/birds
• Sirens
• Traffic noise (especially at junctions)
• Music
• Laughter
• Phones ringing
• Tills
• Children playing outside or running around (in shops, on the street and at parks)
• Beeping signal at crossings
• Garden centres – high glass ceilings, open plan, trolleys
• Road/ tyre and traffic noise when in a car or on the bus
• Also mentioned how people you speak to in the car may be in front or behind you
• Trains and the tube
• Aeroplanes and airports (suitcases rolling)
• Tannoys

### Characteristics of processed speech to consider

• Clarity (clearness) or quality
• Rhythm of speech
• ‘Inflection’ (intonation)
• Similarity to original speaker
• Agreed that in situations where the voice would not be processed clearly, i.e. outside with many noise sources, not sounding like the original speaker is fine.

• Speed of speech; it was suggested that we have sentences read at different speeds as faster talkers are often harder to understand.
• Stated that emphasis on key words is useful for following conversation; perhaps key words in the sentence when marked should be given higher value.
• Lots of comments on room acoustics, i.e., ceiling heights, furnishings, floorings, windows etc., which has a big impact on how difficult it is to have a conversation with background noise.
• Different accents of talkers can make conversation more difficult; including speakers with different accents in the background.

We’re now working out what sounds to use. But are there other sounds we should consider?

## Why use machine learning challenges for hearing aids?

An overview of why machine learning challenges have potential to improve hearing aid signal processing.

The Clarity Project is based around the idea that machine learning challenges could improve hearing aid signal processing. After all this has happened in other areas, such as automatic speech recognition (ASR) in the presence of noise. The improvements in ASR have happened because of:

• Machine learning (ML) at scale – big data and raw GPU power.
• Benchmarking – research has developed around community-organised evaluations or challenges.
• Collaboration has been enabled by these challenges, allowing working across communities such as signal processing, acoustic modelling, language modelling and machine learning

We’re hoping that these three mechanisms can drive improvements in hearing aids.

## Components of a challenge

There needs to be a common task based on a target application scenario to allow communities to gain from benchmarking and collaboration. Clarity project’s first enhancement challenge will be about hearing speech from a single talker in a typical living room, where there is one source of noise and a little reverberation.

We’re currently working on developing simulation tools to allow us to generate our living room data. The room acoustic will be simulated using RAVEN and the Hearing Device Head-related Transfer Functions will come from Denk’s work. We’re working on getting better, more ecologically valid speech than is often used in speech intelligibility work.

Entrants are then given training data and development (dev) test data along with a baseline system that represents the current state-of-the-art. You can find a post and video on the current thinking on the baseline here. We’re still working on the rules stipulating what is and what is not allowed (for example, will entrants be allowed to use data from outside the challenge).

Clarity’s first enhancement challenge is focussed on maximising the speech intelligibility (SI) score. We will evaluate this first through a prediciton model that is based on a hearing loss simulation and an objective metric for speech intellibility. Simulation has been hugely important for generating training data in the CHIME challenges and so we intend to use that approach in Clarity. But results from simulated test sets cannot be trusted and hence a second evaluation will come through perceptual tests on hearing impaired subjects. However, one of our current problems is that we can’t bring listeners into our labs because of COVID-19.

We’ll actually be running two challenges in roughly parallel, because we’re also going to task the community to improve our prediction model for speech intelligibility.

We’re running a series of challenges over five years. What other scenarios should we consider? What speech? What noise? What environment? Please comment below.

## Acknowledgements

Much of this text is based on Jon Barker’s 2020 SPIN keynote