Modelling the scenario

The scenario is that of a listener listening to a target speaker in a room with two or three interfering sound sources are also active. The scenes are described by a large number of randomized parameters:

The room size and materials.
Who the target talker is.
The sentence being uttered by the target talker.
The listener, target talker and noise interferer locations.
The head orientation of the listener, including the rotation during the audio.
The interferer sound samples.
The onset and offset times of the speech.

Below is a detailed description of how these are used to generate each scenario. But it is also possible to work from our database of pre-mixed hearing aid microphone signals without worrying too much about all the details of how these were created.

Brief overview of random scenario generation

The scenarios are for:

Small rooms that have low to moderate reverberation with randomized dimensions and locations of materials such as carpets.
The locations of the listener, target and interferer are randomized.
The target talker is selected from our set of 40 speakers.
The target talker produces a randomly chosen 7-10 word utterance.
There are two or three interferer sounds running throughout the audio. This can be a:
- stream of competing speech;
- continuous domestic noise source (e.g., a washing machine); or
- music source.
The target speech source will onset about one second into the scene.
The listener starts not looking at the target talker, but around the time the target speech starts, the listener rotates their head to approximately face towards the target.

An example scenario is shown in Figure 1. It also defines the coordinate system and origin for the room generation.

Figure 1, An example scenario with two noise interferers.

Room geometry

Cuboid rooms with dimensions length $L$ by width $W$ by height $H$ .
Length $L$ set using a uniform probability distribution random number generator with $3 < L(m) \le 8$ .
Height $H$ set using a Gaussian distribution random number generator with a mean of 2.7 m and standard deviation of 0.8 m.
Area $L×W$ set using a Gaussian distribution random number generator with mean 17.7 m $^2$ and standard deviation of 5.5 m $^2$

Room materials

One of the walls of the room is randomly selected for the location of the door. The door can be at any position with the constraint of being at least 20 cm from the corner of the wall.

A window is placed on one of the other three walls. The window could be at any position of the wall but at 1.9 m height and at 0.4 m from any corner. The curtains are simulated to the side of the window. For larger rooms, a second window and curtains are simulated following a similar methodology.

A sofa is simulated at a random position as a layer on the wall and the floor. Finally, a rug is simulated at a random location on the floor.

The listener (receiver)

The listener has position, $\vec{r} = (x_r,y_r,z_r)$

This is positioned within the room using uniform probability distribution random number generators for the x and y coordinates (see Figure 2 for origin location). There are constraints to ensure that the receiver is not too close to the wall:

$-W/2+1 \le x_r \le W/2-1$
$1 \le y_r \le L-1$
$z_r$ either 1.2 m (sitting) or 1.6 m (standing).

Head rotation

The listener is initially oriented away from the target and will turn to be roughly facing the target talker around the time when the target speech starts

Orientation of listener at start of the sample ~25° from facing the target (standard deviation = 5°), limited to +-2 standard deviations.
Start of rotation is between -0.635 s to 0.865s (rectangular probability)
The rotation lasts for 200 ms (standard deviation =10 ms)
Orientation after rotation is 0-10° (random with rectangular probability distribution).

The target talker

The target talker has position $\vec{t} = (x_t,y_t,z_t)$

The target talker is positioned within the room using uniform probability distribution random number generators for the coordinates. Constraints ensure the target is not too close to the wall or receiver. It is set to have the same height as the receiver.

$-W/2+1 \le x_t \le W/2-1$
$1 \le y_t \le L-1$
$|r-t| > 1$
$z_t=z_r$

A speech directivity pattern is used, which is directed at the listener. The target speech starts between 1.0 and 1.5 seconds into the mixed sound files (rectangular probability distribution).

The interferers

The interferers have position $\vec{i}_{1,2,3} = (x_i,y_i,z_i)$

Each interferer is modelled as an omnidirectional point source. They will be radiating: speech, noise or music. They are placed within the room using uniform probability distribution random number generators for the coordinates. The following constraints ensure the interferer is not too close to the wall or listener. However, interferers are independently positioned with no constraint on their position relative to each other. They are set to be at the same height as the listener. Note, this means that the interferers can be at any angle relative to the listener.

$-W/2+1 \le x_i \le W/2-1$
$1 \le y_i \le L-1$
$|r-i| \gt 1$
$z_i = z_r$

The interferers are present over the whole mixed sound file.

Signal-to-noise ratio (SNR)

The SNR of the mixtures are engineered to achieve a suitable range of speech intelligibility values. A desired signal-to-noise ratio, SNR $_D$ (dB), is chosen at random. This is generated with a uniform probability distribution between limits determined by pilot listening tests. The better ear SNR (BE_SNR) models the better ear effect in binaural listening. It is calculated for the reference channel (channel 1, which corresponds to the front microphone of the hearing aid). This value is used to scale all interferer channels. The procedure is described below.

For the reference channel,

The segment of the summed interferers that overlaps with the target (without padding), $i'$ , and the target (without padding), $t'$ , are extracted
Speech-weighted SNRs are calculated for each ear, SNR $_L$ $_{L}$ and SNR $_R$ $_{R}$ :
- Signals $i'$ and $t'$ are separately convolved with a speech-weighting filter, h (specified below).
- The rms is calculated for each convolved signal.
- SNR $_L$ and SNR $_R$ are calculated as the ratio of these rms values.
The BE_SNR is selected as the maximum of the two SNRs: BE_SNR = max(SNR $_L$ and SNR $_R$ ).

Then per channel,

The summed interferer signal, i, is scaled by the BE_SNR
- $i = i \times$ BE_SNR
Finally, i is scaled as follows:
- $i = i \times 10^{-SNR_D/20}$

The speech-weighting filter is an FIR designed using the host window method [2, 3]. The frequency response is shown in Figure 2. The specification is:

Frequency (Hz) = [0, 150, 250, 350, 450, 4000, 4800, 5800, 7000, 8500, 9500, 22050]
Magnitude of transfer function at each frequency = [0.0001, 0.0103, 0.0261, 0.0419, 0.0577, 0.0577, 0.046, 0.0343, 0.0226, 0.0110, 0.0001, 0.0001]

Figure 2, Speech weighting filter transfer function graph.

References

Schröder, D. and Vorländer, M., 2011, January. RAVEN: A real-time framework for the auralization of interactive virtual environments. In Proceedings of Forum Acusticum 2011 (pp. 1541-1546). Denmark: Aalborg.
Abed, A.H.M. and Cain, G.D., 1978. Low-pass digital filtering with the host windowing design technique. Radio and Electronic Engineer, 48(6), pp.293-300.
Abed, A.E. and Cain, G., 1984. The host windowing technique for FIR digital filter design. IEEE transactions on acoustics, speech, and signal processing, 32(4), pp.683-694.

Brief overview of random scenario generation​

Room geometry​

Room materials​

The listener (receiver)​

Head rotation​

The target talker​

The interferers​

Signal-to-noise ratio (SNR)​

References​