This pages contains some background information on the topics of speech intelligibility, hearing loss and objective measures. We will also be updating it with answers to any challenge specific questions that we receive.
What is Speech Intelligibility?
The term Speech Intelligibility is generally used in two different ways. It can refer to how much speech is understood by a listener, or to the number of words correctly identified by a listener as a proportion or percentage of the total number of words. In the Clarity project, we are using the latter definition, i.e., the percentage of words in a sentence that a listener identified correctly. This percentage is the target for your prediction models.
Speech intelligibility captures how a listener's ability to participate in conversation is changed when the speech signal is degraded, e.g., by background noise and room reverberation, or is processed, e.g., by a hearing aid. Your prediction model will need to incorporate a model of the hearing abilities of each listener.
How is Speech Intelligibility measured with listeners?
In the Clarity project, a set of listeners listen to a sentence and then say what words they heard. In this project, speech intelligibility is measured as the number of words identified correctly as a percentage of the total number of words in a sentence.
You might consider looking at other metrics, such as Word Error Rate (WER), which picks up on, e.g., where listeners insert words not in the original sentence. You might do this if you think that an estimate of WER or other metrics would help your system to estimate speech intelligibility, as defined in the Clarity project.
How is Speech Intelligibility objectively measured by a computer?
When fitting a hearing aid, it would be beneficial for an audiologist to be able to use an objective measure of speech intelligibility to determine what signal processing algorithm(s) should be used to compensate for the listener's hearing impairment. Objective measures are also useful when measured speech intelligibility scores are unavailable, such as when developing a machine learning-based hearing aid algorithm or some other speech enhancement method. Another advantage of non-intrusive measures is that they do not require time-alignment of processed and reference signals.
Objective measures - or metrics - of speech intelligibility are used to allow a computer to estimate the likely performance of humans in listening tests. The main goal of entries to the prediction challenge is to produce one of these measures that performs well for listeners with hearing loss. There are two broad classes of speech intelligibility models:
- Intrusive metrics (also known as double-ended) are most common. This is where the intelligibility is estimated by comparing the degraded or processed speech signal with the original clean speech signal.
- Non-intrusive metrics (also known as single-ended or blind) are less well developed. This is where intelligibility is estimated from the degraded or processed speech signal alone.
In the Clarity project, both types of metrics are of interest. Intrusive metrics will be more accurate in many cases. However, there are hearing aid processes where the speech content is shifted in frequency, which will defeat most current intrusive speech intelligibility metrics. We also hypothesise that there might be issues with intrusive metrics and machine learning approaches in hearing aids that revoice the original speech.
What speech intelligibility models already exist and what are they used for?
There aren't many speech intelligibility models that consider hearing impairment, but one that does is HASPI by Kates and Arehart. In this seminar from the first Clarity workshop, James Kates discusses speech intelligibility models with a focus on the ones he has developed. He also discusses the speech quality metric HASQI. If you're interested in using HASPI or HASQI for the challenge, James Kates has kindly made the MATLAB code and user guide available for download.
Click arrow to see synopsis.
Signal degradations, such as additive noise and nonlinear distortion, can reduce the intelligibility and quality of a speech signal. Predicting intelligibility and quality for hearing aids is especially difficult since these devices may contain intentional nonlinear distortion designed to make speech more audible to a hearing-impaired listener. This speech processing often takes the form of time-varying multichannel gain adjustments. Intelligibility and quality metrics used for hearing aids and hearing-impaired listeners must therefore consider the trade-offs between audibility and distortion introduced by hearing-aid speech envelope modifications. This presentation uses the Hearing Aid Speech Perception Index (HASPI) and the Hearing Aid Speech Quality Index (HASQI) to predict intelligibility and quality, respectively. These indices incorporate a model of the auditory periphery that can be adjusted to reflect hearing loss. They have been trained on intelligibility scores and quality ratings from both normal-hearing and hearing-impaired listeners for a wide variety of signal and processing conditions. The basics of the metrics are explained, and the metrics are then used to analyse the effects of additive noise on speech, to evaluate noise suppression algorithms, and to measure differences among commercial hearing aids.
There are many types of hearing loss, but the focus of the Clarity project is the hearing loss that happens with ageing. This is a form of sensorineural hearing loss.
How does hearing loss affect the perception of audio signals, and how do modern hearing aids process sound to help with this?
In this seminar from the first Clarity workshop, Karolina Smeds from ORCA Europe and WS Audiology discusses the effects of hearing loss and the hearing aid processing strategies that are typically used to counter the sensory deficits.
Click arrow to see synopsis.
Hearing loss leads to several unwanted effects. Loss of audibility for soft sounds is one effect, but also when amplification is used to create audibility for soft sounds, many suprathreshold deficits remain. The most common type of hearing loss is a cochlear hearing loss, where haircells or nerve synapses in the cochlea are damaged. Ageing and noise exposure are the most common causes of cochlear hearing loss. This type of hearing loss is associated with atypical loudness perception and difficulties in noisy situations. Background noise masks for instance speech to a higher degree than for a person with healthy hair cells. This explains why listening to speech-in-noise (SPIN) is such an important topic to work on. A brief introduction to signal processing in hearing aids will be presented. With the use of frequency-specific amplification and compression (automatic gain control, AGC), hearing aids are usually doing a good job in compensating for reduced audibility and for atypical suprathreshold loudness perception. However, it is more difficult to compensate for the increased masking effect. Some examples of strategies will be presented. Finally, natural conversations in noise will be discussed. The balance between being able to have a conversation with a specific communication partner in a group of people and being able to switch attention if someone else starts to talk will be touched upon.
Do I have to use a separate hearing loss model?
No is the short answer! In the baseline, we've used the Cambridge hearing loss model and a separate binaural speech intelligibility model. Another approach would be to create a single model that encapsulates the combined effects of hearing loss and speech perception.
What should the output of my prediction model be?
The output should include a predicted speech intelligibility score per input signal, specifically, an estimate of the number of words correct as a percentage of the total number of words in the signal.
Do you have suggestions for expanding the training data?
The prediction challenge data is limited by having to get the ground truth from listening tests on people with a hearing loss. We look forward to seeing what approaches teams use to help overcome this limitation, such as using unsupervised models, data augmentation or generating additional ground truth data using a pre-existing model. The baseline model includes a hearing loss and speech intelligibility model that could be used for creating additional pre-training data. There are other models that you might consider where code is available. None has been checked by the Clarity team.
- Katerina Zmolikova has made her Pytorch version of the baseline hearing impairment and speech intelligibility model available. Both model fit a neural network framework, are faster but more approximate (see graphs on github).
- HASQI and HASPI are quality and speech intelligibility metrics designed to work for people with a hearing impairment. James Kates explains more about these above. MATLAB code HASPI v2 and HASQI v2 are available, along with the user guide.
- STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model by Ryandhimas Zezario et al. is monaural and non-intrusive using Python, Keras and TensorFlow. It doesn't model the effect of hearing loss. An alternative is Asger Heidemann Andersen's MATLAB code.
We have audiograms for all our listening panel. But for other characterisations of hearing, only some of the panel have provided data. Therefore there is missing data that has to be dealt with.
One approach to the missing data is to just ignore it and just use the audiograms. The problem with this approach is that audiograms only quantify the hearing threshold, and our speech in noise audio samples were not played that quietly. Nevertheless, audiograms are the most common way of characterising hearing loss.
Alternatively, a method to use the partial data could be developed, and we expect this would help with speech intelligibility prediction. You will find plenty of data science blog posts, e.g. towards data science discussing different approaches.
A key question is whether the missing data is 'missing at random' i.e. is the distribution of the missing data expected to be the same as that of the present data? For the prediction challenge, this would mean the missing triple-digit-test values are coming from some random sample of the listeners, who'd be no different from the listeners who did complete the triple-digital-test. Unfortunately, this might not be true, because the failure to complete the triple-digit-tests could well correlate with hearing loss (e.g. maybe older people with more hearing loss were less likely to do the test). The Clarity data is probably 'missing not at random'.
One simple solution is to delete examples with missing data, but the loss of so much data probably makes this undesirable.
A more sophisticated approach is to fill gaps in data via imputation i.e. first estimate values for the missing data and then treat the dataset as complete. A couple of simple approaches for imputation are: (i) use the mean value from the dataset for missing values, and (ii) create a model to estimate the missing data from the audiograms. There are other approaches in data science that could be exploited such as coding the missing values into a 'N/A' category within the input data.