Baseline speech intelligibility model in round one

Some comments on signal alignment and level-insensitivity

Our baseline binaural speech intelligibility measure in round one is the Modified Binaural Short-Time Objective Intelligibility measure, or MBSTOI. This short post outlines the importance of correcting for delays that your hearing aid processing algorithm introduces into the audio signals to allow MBSTOI to estimate the speech intelligibility accurately. It also discusses the importance of considering the audibility of signals before evaluation with MBSTOI.

Evaluation

In stage one, entries will be ranked according to the average MBSTOI score across all samples in the evaluation test set. In the second stage, entries will be evaluated by the listening panel. There will be prizes for both stages. See this post for more information.

Signal alignment in time and frequency

If the signal processed by the hearing aid introduces a significant delay, you should correct for this delay before submitting your entry. This is necessary because MBSTOI requires alignment of the clean speech “reference” with the processed signal in time and frequency. This needs to be done for both ear signals.

MBSTOI downsamples signals to 10 kHz, uses a Discrete Fourier Transform to decompose the signal into one-third octave bands, and performs envelope extraction and short-time segmentation into 386 ms regions. Each region consists of 30 frames. These approaches are motivated by what is know about which frequencies and modulation frequencies are most important for intelligibility. For each frequency band and frame (over the region of which it is the last frame), an intermediate correlation coefficient is calculated between the clean reference and processed power envelopes for each ear. These are averaged to obtain the MBSTOI index. Thus is usually between 0 and 1, and rises monotonically with measured intelligibility scores, such that higher values indicate greater speech intelligibility. Alignment is therefore required at the level of the one-third octave bands and short-time regions.

Our baseline corrects for broadband delay per ear due to the hearing loss model. (The delay is measured by running a kronnecker delta function through the model for each ear.) However, the baseline software will not correct for delays created by your hearing aid processing.

Consequently, when submitting your hearing aid output signals, you are responsible for correcting for any delays introduced by your hearing aid. Note that this must be done blindly; the clean reference signals will not be supplied for the test/evaluation set.

Level insensitivity

MBSTOI is level-independent, i.e., MBSTOI is broadly insensitive to the level of the processed signal because it is calculated using a cross-correlation method. This could be a problem because sounds that are below the auditory thresholds of the hearing impaired listener may appear to MBSTOI to be highly intelligible.

To overcome this, the baseline experimental code mbstoi_beta, in conjunction with the baseline hearing loss model, can be used to approximate hearing-impaired auditory thresholds. Specifically, mbstoi_beta adds internal noise that can be used to approximate normal hearing auditory thresholds. This noise, in combination with the attenuation of signals by the hearing loss model to simulate raised auditory thresholds, makes MBSTOI level-sensitive.

The noise is created by filtering white noise using pure tone threshold filter coefficients with one-third octave weighting, approximating the shape of a typical auditory filter (from Moore 2012, based on Patterson’s method, 1976). This noise is added to the processed signal. Note, the standard MBSTOI in the equalisation-cancellation stage adds internal noise to parameters, but this is an independent process.

MBSTOI

The method was developed by Asger Heidemann Andersen, Jan Mark de Haan, Zheng-Hua Tan and Jesper Jensen (Andersen et al., 2018). It builds on the Short-Time Objective Intelligibility (STOI) metric created by Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen (Taal et al., 2011). MBSTOI includes a better ear stage and an equalisation-cancellation stage. For simplicity, the latter stage is not discussed here; see Andersen et al. (2018) for details.

References

Andersen, A. H., de Haan, J. M., Tan, Z. H., & Jensen, J. (2018). Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions. Speech Communication, 102, 1-13.

Moore, B. C. (2012). An introduction to the psychology of hearing. Brill.

Patterson, R. D. (1976). Auditory filter shapes derived with noise stimuli. The Journal of the Acoustical Society of America, 59(3), 640-654.

Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125-2136.

Hearing aid simulation

What our baseline hearing aid simulates, with examples.

Our challenge entrants are going to use machine learning to develop better hearing aid processing for listening to speech in noise (SPIN). We’ll provide a baseline hearing aid model for entrants to improve on. The figure below shows our baseline system, where the yellow box to the left is where the simulated hearing aid sits (labelled “Enhancement model”).

This image has an empty alt attribute; its file name is baseline-1024x456.png
The draft baseline system (where SPIN is speech in noise, HL is hearing loss, SI is speech intelligibility, and L & R are Left and Right).

We decided to base our simulated hearing aid on the open Master Hearing Aid (openMHA), which is an open-source software platform for real-time audio signal processing. This was developed by the University of Oldenburg, HörTech gGmbH, Oldenburg, and the BatAndCat Corporation, USA. The original version was developed as one of the outcomes of the Cluster of Excellence Hearing4all project. The openMHA platform includes:

  • a software development kit (C/C++ SDK) including an extensive signal processing  library for algorithm development and a set of Matlab and Octave tools to support development and off-line testing
  • real-time runtime environments for standard PC platforms and mobile ARM platforms
  • a set of baseline reference algorithms that forms a complete hearing aid system, including multi-band dynamic compression and amplification, directional microphones,  binaural beamformers and coherence filters, single-channel noise reduction, and feedback control.

We have written a Python wrapper for the core openMHA system for ease of use within machine learning frameworks. We developed a simple generic hearing aid configuration and translated the Camfit compressive fitting, the prescription that takes a listener’s audiogram and determines the right settings for the hearing aid, based on Moore et al. 1999 and encoded by openMHA.

Some aspects of modern digital hearing aids that we’ve decided to simulate are:

  • differential microphones, and
  • a multiband compressor for dynamic compression.

We’ve decided not to simulate the following on the basis that all these tend to be implemented in proprietary forms, such that we can’t replicate them exactly in our open-source algorithm:

  • coordination of gross processing parameters across ears,
  • binaural processing involving some degree of signal exchange between left and right devices,
  • gain changes influenced by speech-to-noise ratio estimators,
  • frequency shifting or scaling, and
  • dual or adaptive time-constant wide dynamic range compression.

We are using the Oldenburg Hearing Device (OlHeaD) Head Related Transfer Function (HRTF) Database (Denk et al. 2018) to replicate the signals that would be received by the front and rear microphones of the hearing aid and also at the eardrums of the wearer.

Audio examples of hearing aid processing

Here is an example of speech in noise processed by the simulated hearing aid for a moderate level of hearing loss. We can hear that the shape of the frequency spectrum has been modified to suit the listener’s specific pattern of hearing loss.

Here’s the original noisy signal where the noise is generated by a washing machine.
Here’s the same signal processed by the simulated hearing aid for a listener with a moderate level of hearing loss (Pure Tone Average of 38). For illustration purposes, this is presented here at an overall level that is similar to that of the original signal.
Here’s the noisy signal as it would be perceived by the listener wearing the hearing aid. Without the aid, the original noisy signal would be near inaudible.

Information about our hearing loss model can be found here.

The target speech comes from our new 40 speaker British English speech database, while the speech interferer noise comes from the SLR83 database, which comprises recordings of male and female speakers of English from various parts of the UK and Ireland.

Acknowledgements

We are grateful to the developers of the openMHA platform for the use of their software. Special thanks are due to Hendrik Kayser and Tobias Herzke. We are also grateful to Brian Moore, Michael Stone and colleagues for the Camfit compressive prescription, and to the people involved in the preparation of the OlHead HRTF (particularly Florian Denk) and SLR83 databases. The feature image is taken from Denk et al. (2018).

References

Demirsahin, I., Kjartansson, O., Gutkin, A., & Rivera, C. E. (2020). Open-source Multi-speaker Corpora of the English Accents in the British Isles. Available at http://www.openslr.org/83/

Denk, F., Ernst, S. M., Ewert, S. D., & Kollmeier, B. (2018). Adapting hearing devices to the individual ear acoustics: Database and target response correction functions for various device styles. Trends in Hearing, 22, 2331216518779313.

Moore, B. C. J., Alcántara, J. I., Stone, M. A., & Glasberg, B. R. (1999). Use of a loudness model for hearing aid fitting: II. Hearing aids with multi-channel compression. British Journal of Audiology, 33(3), 157-170.

Hearing loss simulation

What our hearing loss algorithms simulate, with audio examples to illustrate hearing loss.

Our challenge entrants are going to use machine learning to develop better processing of speech in noise (SPIN) for hearing aids. For a machine learning algorithm to learn new ways of processing audio for the hearing impaired, it needs to estimate how the sound will be degraded by any hearing loss. Hence, we need an algorithm to simulate hearing loss for each of our listeners. The diagram belows shows our draft baseline system that was detailed in a previous blog. The hearing loss simulation is part of the prediction model. The Enhancement Model to the left is effectively the hearing aid and the Prediction Model to the right is estimating how someone will perceive the intelligibility of the speech in noise.

The draft baseline system (where SPIN is speech in noise, DRC is Dynamic Range Compression, HL is Hearing Loss, SI is Speech Intelligibility and L & R are Left and Right).

There are different causes of hearing loss, but we’re concentrating on the most common type that happens when you age (presbycusis). RNID (formerly Action on Hearing Loss) estimate that more than 40% of people over the age of 50 have a hearing loss, and this rises to 70% of people who are older than 70.

The aspects of hearing loss we’ve decided to simulate are

  1. The loss of ability to sense the quietest sounds (increase in absolute threshold).
  2. How as an audible sound increases in level, the perceived increase in loudness is greater than normal (loudness recruitment) (Moore et al. 1996).
  3. How the ear has a poorer ability to discriminate the frequency of sounds (impaired frequency selectivity).

Audio examples of hearing loss

Here are two samples of speech in noise processed through the simulator. In each audio example there are three versions of the same sentence:

  1. Unimpaired hearing
  2. Mild hearing impairment
  3. Moderate to severe hearing impairment
0 dB signal to noise ratio

And here is an example where the noise is louder:

Noisier: -10dB signal to noise ratio

Acknowledgements

The hearing loss model we’re using was generously supplied by Michael Stone at the University of Manchester as MATLAB code and translated by us into Python. The original code was written by members of the Auditory Perception Group at the University of Cambridge, ca. 1991-2013, including Michael Stone, Brian Moore, Brian Glasberg and Thomas Baer. Information about the model can be found primarily in Nejime and Moore (1997), but also in Nejime and Moore (1998), Baer and Moore (1993 and 1994), and Moore and Glasberg (1993).

The original speech recordings come from the ARU corpus, University of Liverpool (Hopkins et al. 2019). This corpus is freely available at the link in the reference below.

References

Baer, T., & Moore, B. C. (1993). Effects of spectral smearing on the intelligibility of sentences in noise. The Journal of the Acoustical Society of America, 94(3), 1229-1241.

Baer, T., & Moore, B. C. (1994). Effects of spectral smearing on the intelligibility of sentences in the presence of interfering speech. The Journal of the Acoustical Society of America, 95(4), 2277-2280.

Hopkins, C., Graetzer, S., & Seiffert, G. (2019). ARU adult British English speaker corpus of IEEE sentences (ARU speech corpus) version 1.0 [data collection]. Acoustics Research Unit, School of Architecture, University of Liverpool, United Kingdom. DOI: 10.17638/datacat.liverpool.ac.uk/681. Retrieved from http://datacat.liverpool.ac.uk/681/.

Moore, B. C., & Glasberg, B. R. (1993). Simulation of the effects of loudness recruitment and threshold elevation on the intelligibility of speech in quiet and in a background of speech. The Journal of the Acoustical Society of America, 94(4), 2050-2062.

Moore, B. C., Glasberg, B. R., & Vickers, D. A. (1996). Factors influencing loudness perception in people with cochlear hearing loss. B. Kollmeier, World Scientific, Singapore, 7-18.

Nejime, Y., & Moore, B. C. (1997). Simulation of the effect of threshold elevation and loudness recruitment combined with reduced frequency selectivity on the intelligibility of speech in noise. The Journal of the Acoustical Society of America, 102(1), 603-615.

Nejime, Y., & Moore, B. C. (1998). Evaluation of the effect of speech-rate slowing on speech intelligibility in noise using a simulation of cochlear hearing loss. The Journal of the Acoustical Society of America, 103(1), 572-576.

The baseline

An overview of the current state of the baseline we’re developing for the machine learning challenges

We’re currently developing the baseline processing that challenge entrants will need. This takes a random listener and a random audio sample of speech in noise (SPIN) and passes that through a simulated hearing aid (the Enhancement Model). This improves the speech in noise. We then have an algorithm (the Prediction Model) to estimate the Speech Intelligibility that the listener would perceive (SI score). This score can then be used to drive machine learning to improve the hearing aid.

A talk through the baseline model we’re developing.

The first machine learning challenge is to improve the enhancement model, in other words, to produce a better processing algorithm for the hearing aid. The second challenge is to improve the prediction model using perceptual data we’ll provide.