One approach to our enhancement challenge

Improving hearing aid processing using DNNs blog. A suggested approach to overcome the non-differentiable loss function.

The aim of our Enhancement Challenge is to get people producing new algorithms for processing speech signals through hearing aids. We expect most entries to replace the classic hearing aid processing of Dynamic Range Compressors (DRCs) with deep neural networks (DNN) (although all approaches are welcome!). The first round of the challenge is going to be all about improving speech intelligibility.

Setting up a DNN structure and training regime for the task is not as straightforward as it might first appear. Figure 1 shows an example of a naive training regime. An audio example of Speech in Noise (SPIN) is randomly created (audio sample generation, bottom left), and a listener is randomly selected with particular hearing loss characteristics (random artificial listener generation, top left). The DNN Enhancement model (represented by the bright yellow box) then produces improved speech in noise. (Audio signals in pink are two-channel, left and right because this is for binaural hearing aids.)

Figure 1

Next the improved speech in noise is passed to the Prediction Model in the lime green box, and this gives an estimation of the Speech Intelligibility (SI). Our baseline system will include algorithms for this. We’ve already blogged about the Hearing Loss Simulation. Our current thinking is that the intelligibility model will be using a binaural form of the Short-Time Objective Intelligibility Index (STOI) [1]. The dashed line going back to the enhancement model shows that the DNN will be updated based on the reciprocal of the Speech Intelligibility (SI) score. By minimising (1/SI), the enhancement model will be maximising intelligibility.

The difficulty here is that updating the Enhancement Model DNN during training requires the error to be known at the DNN’s output (the point labelled “improved SPIN”). But we don’t know this, we only know the error on the output of the prediction model at the far right of the diagram. This wouldn’t be a problem if the prediction model could be inverted, because we could then run the 1/SI error backwards through the inverse model.

As the inverse of the prediction model isn’t available, one solution is to train another DNN to mimic its behaviour (Figure 2). As this new Prediction Model is a DNN, the 1/SI error can be passed backwards through it using standard neural network training formulations.

This DNN prediction model could be trained first using knowledge distillation (this is something I’ve previous done for a speech intelligibility model), and then the weights frozen while the Enhancement Model is trained. But there is a ‘chicken and egg’ problem here. The difficulty is generating all the training data for the prediction model. Until you train the enhancement model, you won’t have a representative examples of “improved SPIN” to train the prediction model. But without the prediction model, you can’t train the enhancement model.

One solution is to train the two DNNs in tandem, with an approach analogous to how pairs of networks are trained in a Generative Adversarial Network (GAN). iMetricGan developed by Li et al. [2] is an example of this being done for speech enhancement, although the authors weren’t trying to include hearing loss simulation. They aren’t the only ones looking at trying to solve problems where a non-differentiable or black-box evaluation function is in the way of DNN training [3][4].

We hope our entrants will come up with lots of other ways of overcoming this problem. How would you tackle it?

References

[1] Andersen, A.H., Haan, J.M.D., Tan, Z.H. and Jensen, J., 2015. A binaural short time objective intelligibility measure for noisy and enhanced speech. In the Sixteenth Annual Conference of the International Speech Communication Association.

[2] Li, H., Fu, S.W., Tsao, Y. and Yamagishi, J., 2020. iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning. arXiv preprint arXiv:2004.00932.

[3] Gillhofer, M., Ramsauer, H., Brandstetter, J., Schäfl, B. and Hochreiter, S., 2019. A GAN based solver of black-box inverse problems. Proceedings of the NeurIPS 2019 Workshop.

[4] Kawanaka, M., Koizumi, Y., Miyazaki, R. and Yatabe, K., 2020, May. Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7524-7528). IEEE.

Sounds for challenge 1

We’ll be challenging our contestants to find innovative ways of making speech more audible for hearing impaired listeners when there is noise getting in the way. But what noises should we consider? To aid us in choosing sounds and situations that are relevant to people with hearing aids, we held a focus group.

We wanted to know about

  • Everyday background noises that make having a conversation difficult.
  • The characteristics of speech after it has been processed by a hearing-aid that hearing aid listeners would value

A total of eight patients (four males, four females) attended the meeting, six of whom were recruited from the Nottingham Biomedical Research Centre’s patient and public involvement contact list. Two attendees were recruited from a local lip reading class organised by the Nottinghamshire Deaf Society. The range of hearing loss within the group is from mild to severe. They all regularly use bilateral hearing aids.

Our focus was on the living room because that is the scenario for round one of the challenge.

Photo by Gustavo Fring from Pexels

Everyday background noises that interfere with understanding of speech

A long and varied list of sounds cause problems. These lists are in no particular order.

Living room or space

  • Clocks ticking
  • Crisp packets rustling
  • Taps running
  • Kettles boiling
  • Dishwasher
  • Microwave
  • Washing machine
  • TV, music, radio
  • Phone ringing (or receiving texts – unknown beeps/tones)
  • Newspapers rustling
  • Air-conditioning and oven extractor fans
  • Vacuum cleaner
  • Doorbell ringing
  • Dog barking
  • Rain on window

Family and friends

  • Cutlery/crockery banging/clanging
  • Doors opening/closing (to rooms and cupboards)
  • Music
  • People walking around the room
  • Children playing with toys
  • Laughing
  • People talking from another room
  • Speakers from a different conversation in close proximity (i.e. beside you) when you are trying to converse
  • Traffic outside
  • Chewing/chomping
  • Steam pipes/ coffee machines
  • Chairs being moved

Outside

  • Church bells
  • Market noise
  • Footsteps on different types of ground, i.e. heels on hard floors but also wellingtons in mud
  • Clothes rustling (such as waterproof coats or hat on hearing aid)
  • Wind (even with HA on ‘wind setting’)
  • Pigeons/birds
  • Sirens
  • Traffic noise (especially at junctions)
  • Music
  • Laughter
  • Phones ringing
  • Tills
  • Children playing outside or running around (in shops, on the street and at parks)
  • Beeping signal at crossings
  • Garden centres – high glass ceilings, open plan, trolleys
  • Road/ tyre and traffic noise when in a car or on the bus
  • Also mentioned how people you speak to in the car may be in front or behind you
  • Trains and the tube
  • Aeroplanes and airports (suitcases rolling)
  • Tannoys

Characteristics of processed speech to consider

  • Clarity (clearness) or quality
  • Rhythm of speech
  • ‘Inflection’ (intonation)
  • Similarity to original speaker
  • Agreed that in situations where the voice would not be processed clearly, i.e. outside with many noise sources, not sounding like the original speaker is fine.

Other comments

  • Speed of speech; it was suggested that we have sentences read at different speeds as faster talkers are often harder to understand.
  • Stated that emphasis on key words is useful for following conversation; perhaps key words in the sentence when marked should be given higher value.
  • Lots of comments on room acoustics, i.e., ceiling heights, furnishings, floorings, windows etc., which has a big impact on how difficult it is to have a conversation with background noise.
  • Different accents of talkers can make conversation more difficult; including speakers with different accents in the background.

We’re now working out what sounds to use. But are there other sounds we should consider?

Credits

Eszter Porter Q&A

What is your role on the clarity Project?

My main role is focused upon the recruitment of participants (both with healthy hearing and hearing loss) to assess how well the simulated hearing aids work. The participants will listen to sentences of speech in noise and write down what words they hear. These participants will help show which hearing aid model is showing the most promise.

How did you end up working in this area?

I had recently completed an MSc in Neuroimaging methods when I began to seek research-based jobs as a way to gain more experience before pursuing a PhD. My MSc thesis was a brain stimulation project looking at speech perception and production processes which led me down the route of hearing sciences. I saw this job as an ideal environment to build upon my skills and better prepare me for my future career.

What is exciting about the clarity project?

An exciting aspect of the project is that it promotes open source code and materials, to encourage the development of science as a challenge that can be taken on by anyone, even if they’re not in the world of hearing and sound.

What would success look like for the project?

To me, success for the project would be seeing contestants from different backgrounds and fields, whether they are from an academic or industry background, collaborating to achieve the best creative solutions to improve hearing aid technology.

How will it make a difference to peoples’ lives?

If the project successfully leads to a new approach to hearing aid technology that improves people’s ability to hear speech in background noise it will be a huge leap forward. It may encourage more people to make use of their hearing aids and ease the feeling of social isolation by allowing them to follow a conversation with more ease.

What’s the best thing you’ve read, seen or listened to over the last year?

The Everything Hertz podcast. It is an entertaining way to listen to some critical discussions about scientific life and research hosted by 2 friends who completed their PhDs together. (Warning, there’s adult language on the podcast)

Tell us a sound fact that will blow our minds!

It is believed that the animal that can perceive the highest sound frequency is the wax moth – not a bat!

The wav moth can hear up to 300 kHz! Photo Sarefo, CC BY-SA 3.0

Why use machine learning challenges for hearing aids?

An overview of why machine learning challenges have potential to improve hearing aid signal processing.

The Clarity Project is based around the idea that machine learning challenges could improve hearing aid signal processing. After all this has happened in other areas, such as automatic speech recognition (ASR) in the presence of noise. The improvements in ASR have happened because of:

  • Machine learning (ML) at scale – big data and raw GPU power.
  • Benchmarking – research has developed around community-organised evaluations or challenges.
  • Collaboration has been enabled by these challenges, allowing working across communities such as signal processing, acoustic modelling, language modelling and machine learning

We’re hoping that these three mechanisms can drive improvements in hearing aids.

Components of a challenge

There needs to be a common task based on a target application scenario to allow communities to gain from benchmarking and collaboration. Clarity project’s first enhancement challenge will be about hearing speech from a single talker in a typical living room, where there is one source of noise and a little reverberation.

We’re currently working on developing simulation tools to allow us to generate our living room data. The room acoustic will be simulated using RAVEN and the Hearing Device Head-related Transfer Functions will come from Denk’s work. We’re working on getting better, more ecologically valid speech than is often used in speech intelligibility work.

Entrants are then given training data and development (dev) test data along with a baseline system that represents the current state-of-the-art. You can find a post and video on the current thinking on the baseline here. We’re still working on the rules stipulating what is and what is not allowed (for example, will entrants be allowed to use data from outside the challenge).

Clarity’s first enhancement challenge is focussed on maximising the speech intelligibility (SI) score. We will evaluate this first through a prediciton model that is based on a hearing loss simulation and an objective metric for speech intellibility. Simulation has been hugely important for generating training data in the CHIME challenges and so we intend to use that approach in Clarity. But results from simulated test sets cannot be trusted and hence a second evaluation will come through perceptual tests on hearing impaired subjects. However, one of our current problems is that we can’t bring listeners into our labs because of COVID-19.

We’ll actually be running two challenges in roughly parallel, because we’re also going to task the community to improve our prediction model for speech intelligibility.

We’re running a series of challenges over five years. What other scenarios should we consider? What speech? What noise? What environment? Please comment below.

Acknowledgements

Much of this text is based on Jon Barker’s 2020 SPIN keynote

The baseline

An overview of the current state of the baseline we’re developing for the machine learning challenges

We’re currently developing the baseline processing that challenge entrants will need. This takes a random listener and a random audio sample of speech in noise (SPIN) and passes that through a simulated hearing aid (the Enhancement Model). This improves the speech in noise. We then have an algorithm (the Prediction Model) to estimate the Speech Intelligibility that the listener would perceive (SI score). This score can then be used to drive machine learning to improve the hearing aid.

A talk through the baseline model we’re developing

The first machine learning challenge is to improve the enhancement model, in other words, to produce a better processing algorithm for the hearing aid. The second challenge is to improve the prediction model using perceptual data we’ll provide.