## Latency, computation time and real-time operation

An explanation of the time and computational limits for the first round of the enhancement challenge.

## Enhancement challenge 2021

For a hearing aid to work well for users, the processing needs to be quick. The output of the hearing aid should be produced with a delay of less than about 10 ms. Many audio processing techniques are non-causal, i.e., the output of the system depends on samples from the future. Such processing is useless for hearing aids and therefore our rules include a restriction on the use of future samples.

The rules state the following:

• Systems must be causal; the output at time t must not use any information from input samples more than 5 ms into the future (i.e., no information from input samples >t+5ms).
• There is no limit on computational cost.

Mathematically this is:

• where yn is the output from your hearing aid for sample n.
• x is the audio input signal from a hearing aid microphone.
• N = 0.005 fs where fs is the sampling frequency.
• m is a sample number where m <= n.
• L is the listener characteristics.
• f() is the hearing aid function. There is no limitation on how long this takes to compute.
• You can use multiple microphones; only a single input signal x is shown here just for simplicity.

Here it is illustrated as a diagram.

We have a chosen a limit of 5 ms because in a real hearing aid there will be other sources of delay (e.g., analogue-to-digital, digital-to-analogue conversion).

## Why is there no limitation of how long f() takes to compute?

We’re trying to foster new approaches to hearing aid processing and decided that at this stage we will drive more innovation if we don’t restrict computation time for round one. Such restrictions will be considered in future rounds.

## Why haven’t you talked about latency?

In discussions, it is apparent that this term is used in different ways by different people, so to avoid confusion we’re not using it!

## Do algorithms have to be real-time?

The above limitations mean that the algorithms could in theory be made real-time if a powerful enough computer was available, but your entry can take as long as it needs to process the signals.

## Clarity Challenge Pre-announcement

Although age-related hearing loss affects 40% of 55 to 74 year-olds, the majority of adults who would benefit from hearing aids don’t use them. A key reason is simply that hearing aids don’t provide enough benefit.

Picking out speech from background noise is a critical problem even for the most sophisticated devices. The purpose of the Clarity Challenges is to catalyse new work to radically improve the speech intelligibility provided by hearing aids.

The series of challenges will consider increasingly complex listening scenarios. The first round, launching in January 2021, will focus on speech in indoor environments in the presence of a single interferer. It will begin with a challenge involving improving hearing aid processing. Future challenges on how to model speech-in-noise perception will be launched at a later date.

You will be provided with simulated scenes, each including a target speaker and interfering noise. For each scene, there will be signals that simulate those captured by a behind-the-ear hearing aid with three channels at each ear and those captured at the eardrum without a hearing aid present.  The target speech will be a short sentence and the interfering noise will be either speech or domestic appliance noise.

The task will be to deliver a hearing aid signal processing algorithm that can improve the intelligibility of the target speaker for a specified hearing-impaired listener. Initially, entries will be evaluated using an objective speech intelligibility measure we will provide. Subsequently, up to twenty of the most promising systems will be evaluated by a panel of listeners.

We will provide a baseline system so that teams can choose to focus on individual components or to develop their own complete pipelines.

## What will be provided

• Evaluation of the best entries by a panel of hearing-impaired listeners.
• Speech + interferer scenes for training and evaluation.
• An entirely new database of 10,000 spoken sentences
• Listener characterisations including audiograms and speech-in-noise testing.
• Software including tools for generating training data, a baseline hearing aid algorithm, a baseline model of hearing impairment, and a binaural objective intelligibility measure.

## Important Dates

• January 2021 – Challenge launch and release of software and data
• April 2021 –  Evaluation data released
• May 2021 – Submission deadline
• June-August 2021  – Listening test evaluation period
• September 2021 – Results announced at a Clarity Challenge Workshop in conjunction with Interspeech 2021

Challenge and workshop participants will be invited to contribute to a journal Special Issue on the topic of Machine Learning for Hearing Aid Processing that will be announced next year.

## Organisers

Prof. Jon P. Barker, Department of Computer Science, University of Sheffield
Prof. Michael A. Akeroyd, Hearing Sciences, School of Medicine, University of Nottingham
Prof. Trevor J. Cox, Acoustics Research Centre, University of Salford
Prof. John F. Culling, School of Psychology, Cardiff University
Prof. Graham Naylor, Hearing Sciences, School of Medicine, University of Nottingham
Dr Simone Graetzer, Acoustics Research Centre, University of Salford
Dr Rhoddy Viveros Muñoz, School of Psychology, Cardiff University
Eszter Porter, Hearing Sciences, School of Medicine, University of Nottingham

Funded by the Engineering and Physical Sciences Research Council (EPSRC), UK.

Supported by RNID (formerly Action on Hearing Loss), Hearing Industry Research Consortium, Amazon TTS Research, Honda Research Institute Europe.

## Acknowledgement

The image copyright is owned by the University of Nottingham.

## One approach to our enhancement challenge

Improving hearing aid processing using DNNs blog. A suggested approach to overcome the non-differentiable loss function.

The aim of our Enhancement Challenge is to get people producing new algorithms for processing speech signals through hearing aids. We expect most entries to replace the classic hearing aid processing of Dynamic Range Compressors (DRCs) with deep neural networks (DNN) (although all approaches are welcome!). The first round of the challenge is going to be all about improving speech intelligibility.

Setting up a DNN structure and training regime for the task is not as straightforward as it might first appear. Figure 1 shows an example of a naive training regime. An audio example of Speech in Noise (SPIN) is randomly created (audio sample generation, bottom left), and a listener is randomly selected with particular hearing loss characteristics (random artificial listener generation, top left). The DNN Enhancement model (represented by the bright yellow box) then produces improved speech in noise. (Audio signals in pink are two-channel, left and right because this is for binaural hearing aids.)

Next the improved speech in noise is passed to the Prediction Model in the lime green box, and this gives an estimation of the Speech Intelligibility (SI). Our baseline system will include algorithms for this. We’ve already blogged about the Hearing Loss Simulation. Our current thinking is that the intelligibility model will be using a binaural form of the Short-Time Objective Intelligibility Index (STOI) [1]. The dashed line going back to the enhancement model shows that the DNN will be updated based on the reciprocal of the Speech Intelligibility (SI) score. By minimising (1/SI), the enhancement model will be maximising intelligibility.

The difficulty here is that updating the Enhancement Model DNN during training requires the error to be known at the DNN’s output (the point labelled “improved SPIN”). But we don’t know this, we only know the error on the output of the prediction model at the far right of the diagram. This wouldn’t be a problem if the prediction model could be inverted, because we could then run the 1/SI error backwards through the inverse model.

As the inverse of the prediction model isn’t available, one solution is to train another DNN to mimic its behaviour (Figure 2). As this new Prediction Model is a DNN, the 1/SI error can be passed backwards through it using standard neural network training formulations.

This DNN prediction model could be trained first using knowledge distillation (this is something I’ve previous done for a speech intelligibility model), and then the weights frozen while the Enhancement Model is trained. But there is a ‘chicken and egg’ problem here. The difficulty is generating all the training data for the prediction model. Until you train the enhancement model, you won’t have a representative examples of “improved SPIN” to train the prediction model. But without the prediction model, you can’t train the enhancement model.

One solution is to train the two DNNs in tandem, with an approach analogous to how pairs of networks are trained in a Generative Adversarial Network (GAN). iMetricGan developed by Li et al. [2] is an example of this being done for speech enhancement, although the authors weren’t trying to include hearing loss simulation. They aren’t the only ones looking at trying to solve problems where a non-differentiable or black-box evaluation function is in the way of DNN training [3][4].

We hope our entrants will come up with lots of other ways of overcoming this problem. How would you tackle it?

## References

[1] Andersen, A.H., Haan, J.M.D., Tan, Z.H. and Jensen, J., 2015. A binaural short time objective intelligibility measure for noisy and enhanced speech. In the Sixteenth Annual Conference of the International Speech Communication Association.

[2] Li, H., Fu, S.W., Tsao, Y. and Yamagishi, J., 2020. iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning. arXiv preprint arXiv:2004.00932.

[3] Gillhofer, M., Ramsauer, H., Brandstetter, J., Schäfl, B. and Hochreiter, S., 2019. A GAN based solver of black-box inverse problems. Proceedings of the NeurIPS 2019 Workshop.

[4] Kawanaka, M., Koizumi, Y., Miyazaki, R. and Yatabe, K., 2020, May. Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7524-7528). IEEE.

## Sounds for round one

We’ll be challenging our contestants to find innovative ways of making speech more audible for hearing impaired listeners when there is noise getting in the way. But what noises should we consider? To aid us in choosing sounds and situations that are relevant to people with hearing aids, we held a focus group.

• Everyday background noises that make having a conversation difficult.
• The characteristics of speech after it has been processed by a hearing-aid that hearing aid listeners would value.

A total of eight patients (four males, four females) attended the meeting, six of whom were recruited from the Nottingham Biomedical Research Centre’s patient and public involvement contact list. Two attendees were recruited from a local lip reading class organised by the Nottinghamshire Deaf Society. The range of hearing loss within the group is from mild to severe. They all regularly use bilateral hearing aids.

Our focus was on the living room because that is the scenario for round one of the challenges.

## Everyday background noises that interfere with understanding of speech

A long and varied list of sounds cause problems. These lists are in no particular order.

### Living room or space

• Clocks ticking
• Crisp packets rustling
• Taps running
• Kettles boiling
• Dishwasher
• Microwave
• Washing machine
• Phone ringing (or receiving texts – unknown beeps/tones)
• Newspapers rustling
• Air-conditioning and oven extractor fans
• Vacuum cleaner
• Doorbell ringing
• Dog barking
• Rain on window

### Family and friends

• Cutlery/crockery banging/clanging
• Doors opening/closing (to rooms and cupboards)
• Music
• People walking around the room
• Children playing with toys
• Laughing
• People talking from another room
• Speakers from a different conversation in close proximity (i.e. beside you) when you are trying to converse
• Traffic outside
• Chewing/chomping
• Steam pipes/ coffee machines
• Chairs being moved

### Outside

• Church bells
• Market noise
• Footsteps on different types of ground, i.e. heels on hard floors but also wellingtons in mud
• Clothes rustling (such as waterproof coats or hat on hearing aid)
• Wind (even with HA on ‘wind setting’)
• Pigeons/birds
• Sirens
• Traffic noise (especially at junctions)
• Music
• Laughter
• Phones ringing
• Tills
• Children playing outside or running around (in shops, on the street and at parks)
• Beeping signal at crossings
• Garden centres – high glass ceilings, open plan, trolleys
• Road/ tyre and traffic noise when in a car or on the bus
• Also mentioned how people you speak to in the car may be in front or behind you
• Trains and the tube
• Aeroplanes and airports (suitcases rolling)
• Tannoys

### Characteristics of processed speech to consider

• Clarity (clearness) or quality
• Rhythm of speech
• ‘Inflection’ (intonation)
• Similarity to original speaker
• Agreed that in situations where the voice would not be processed clearly, i.e. outside with many noise sources, not sounding like the original speaker is fine.

• Speed of speech; it was suggested that we have sentences read at different speeds as faster talkers are often harder to understand.
• Stated that emphasis on key words is useful for following conversation; perhaps key words in the sentence when marked should be given higher value.
• Lots of comments on room acoustics, i.e., ceiling heights, furnishings, floorings, windows etc., which has a big impact on how difficult it is to have a conversation with background noise.
• Different accents of talkers can make conversation more difficult; including speakers with different accents in the background.

We’re now working out what sounds to use. But are there other sounds we should consider?

# What is your role on the clarity Project?

My main role is focused upon the recruitment of participants (both with healthy hearing and hearing loss) to assess how well the simulated hearing aids work. The participants will listen to sentences of speech in noise and write down what words they hear. These participants will help show which hearing aid model is showing the most promise.

# How did you end up working in this area?

I had recently completed an MSc in Neuroimaging methods when I began to seek research-based jobs as a way to gain more experience before pursuing a PhD. My MSc thesis was a brain stimulation project looking at speech perception and production processes which led me down the route of hearing sciences. I saw this job as an ideal environment to build upon my skills and better prepare me for my future career.

# What is exciting about the clarity project?

An exciting aspect of the project is that it promotes open source code and materials, to encourage the development of science as a challenge that can be taken on by anyone, even if they’re not in the world of hearing and sound.

# What would success look like for the project?

To me, success for the project would be seeing contestants from different backgrounds and fields, whether they are from an academic or industry background, collaborating to achieve the best creative solutions to improve hearing aid technology.

# How will it make a difference to peoples’ lives?

If the project successfully leads to a new approach to hearing aid technology that improves people’s ability to hear speech in background noise it will be a huge leap forward. It may encourage more people to make use of their hearing aids and ease the feeling of social isolation by allowing them to follow a conversation with more ease.

# What’s the best thing you’ve read, seen or listened to over the last year?

The Everything Hertz podcast. It is an entertaining way to listen to some critical discussions about scientific life and research hosted by 2 friends who completed their PhDs together. (Warning, there’s adult language on the podcast.)

# Tell us a sound fact that will blow our minds!

It is believed that the animal that can perceive the highest sound frequency is the wax moth, not a bat!

## Why use machine learning challenges for hearing aids?

An overview of why machine learning challenges have potential to improve hearing aid signal processing.

The Clarity Project is based around the idea that machine learning challenges could improve hearing aid signal processing. After all this has happened in other areas, such as automatic speech recognition (ASR) in the presence of noise. The improvements in ASR have happened because of:

• Machine learning (ML) at scale – big data and raw GPU power.
• Benchmarking – research has developed around community-organised evaluations or challenges.
• Collaboration has been enabled by these challenges, allowing working across communities such as signal processing, acoustic modelling, language modelling and machine learning

We’re hoping that these three mechanisms can drive improvements in hearing aids.

## Components of a challenge

There needs to be a common task based on a target application scenario to allow communities to gain from benchmarking and collaboration. Clarity project’s first enhancement challenge will be about hearing speech from a single talker in a typical living room, where there is one source of noise and a little reverberation.

We’re currently working on developing simulation tools to allow us to generate our living room data. The room acoustic will be simulated using RAVEN and the Hearing Device Head-related Transfer Functions will come from Denk’s work. We’re working on getting better, more ecologically valid speech than is often used in speech intelligibility work.

Entrants are then given training data and development (dev) test data along with a baseline system that represents the current state-of-the-art. You can find a post and video on the current thinking on the baseline here. We’re still working on the rules stipulating what is and what is not allowed (for example, will entrants be allowed to use data from outside the challenge).

Clarity’s first enhancement challenge is focussed on maximising the speech intelligibility (SI) score. We will evaluate this first through a prediciton model that is based on a hearing loss simulation and an objective metric for speech intellibility. Simulation has been hugely important for generating training data in the CHIME challenges and so we intend to use that approach in Clarity. But results from simulated test sets cannot be trusted and hence a second evaluation will come through perceptual tests on hearing impaired subjects. However, one of our current problems is that we can’t bring listeners into our labs because of COVID-19.

We’ll actually be running two challenges in roughly parallel, because we’re also going to task the community to improve our prediction model for speech intelligibility.

We’re running a series of challenges over five years. What other scenarios should we consider? What speech? What noise? What environment? Please comment below.

## Acknowledgements

Much of this text is based on Jon Barker’s 2020 SPIN keynote

## The baseline

An overview of the current state of the baseline we’re developing for the machine learning challenges

We’re currently developing the baseline processing that challenge entrants will need. This takes a random listener and a random audio sample of speech in noise (SPIN) and passes that through a simulated hearing aid (the Enhancement Model). This improves the speech in noise. We then have an algorithm (the Prediction Model) to estimate the Speech Intelligibility that the listener would perceive (SI score). This score can then be used to drive machine learning to improve the hearing aid.

The first machine learning challenge is to improve the enhancement model, in other words, to produce a better processing algorithm for the hearing aid. The second challenge is to improve the prediction model using perceptual data we’ll provide.