clarity.predictor.torch_stoi module¶

This implementation is from https://github.com/mpariente/pytorch_stoi, please cite and star the repo. The pip version of torch_stoi does not include EPS in line 127 & 128, hence could lead to sqrt(0)

class clarity.predictor.torch_stoi.NegSTOILoss(*args: Any, **kwargs: Any)[source]¶

Bases: Module

Negated Short Term Objective Intelligibility (STOI) metric, to be used: as a loss function. Inspired from [1, 2, 3] but not exactly the same : cannot be used as the STOI metric directly (use pystoi instead). See Notes.

Parameters:

sample_rate (int) – sample rate of audio input
use_vad (bool) – Whether to use simple VAD (see Notes)
extended (bool) – Whether to compute extended version [3].
do_resample (bool) – Whether to resample audio input to FS

Shapes:: (time,) –> (1, ) (batch, time) –> (batch, ) (batch, n_src, time) –> (batch, n_src)

Returns:: torch.Tensor of shape (batch, *, ), only the time dimension has been reduced.

Warning

This function cannot be used to compute the “real” STOI metric as we applied some changes to speed-up loss computation. See Notes section.

Notes

In the NumPy version, some kind of simple VAD was used to remove the silent frames before chunking the signal into short-term envelope vectors. We don’t do the same here because removing frames in a batch is cumbersome and inefficient. If use_vad is set to True, instead we detect the silent frames and keep a mask tensor. At the end, the normalized correlation of short-term envelope vectors is masked using this mask (unfolded) and the mean is computed taking the mask values into account.

References

[1] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘A Short-Time: Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech’, ICASSP 2010, Texas, Dallas.
[2] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘An Algorithm for: Intelligibility Prediction of Time-Frequency Weighted Noisy Speech’, IEEE Transactions on Audio, Speech, and Language Processing, 2011.
[3] Jesper Jensen and Cees H. Taal, ‘An Algorithm for Predicting the: Intelligibility of Speech Masked by Modulated Noise Maskers’, IEEE Transactions on Audio, Speech and Language Processing, 2016.

static detect_silent_frames(x, dyn_range, framelen, hop)[source]¶

Detects silent frames on input tensor. A frame is excluded if its energy is lower than max(energy) - dyn_range

Parameters:

x (torch.Tensor) – batch of original speech wav file (batch, time)
dyn_range – Energy range to determine which frame is silent
framelen – Window size for energy evaluation
hop – Hop size for energy evaluation

Returns:

torch.BoolTensor, framewise mask.

forward(est_targets: torch.Tensor, targets: torch.Tensor) → torch.Tensor[source]¶

Compute negative (E)STOI loss.

Parameters:

est_targets (torch.Tensor) – Tensor containing target estimates.
targets (torch.Tensor) – Tensor containing clean targets.

Shapes:: (time,) –> (1, ) (batch, time) –> (batch, ) (batch, n_src, time) –> (batch, n_src)

Returns:: torch.Tensor, the batch of negative STOI loss

static rowcol_norm(x, mask=None)[source]¶: Mean/variance normalize axis 2 and 1 of input vector

static stft(x, win, fft_size, overlap=4)[source]¶

clarity.predictor.torch_stoi.masked_mean(x, dim=-1, mask=None, keepdim=False)[source]¶

clarity.predictor.torch_stoi.masked_norm(x, p=2, dim=-1, mask=None, keepdim=False)[source]¶

clarity.predictor.torch_stoi.meanvar_norm(x, mask=None, dim=-1)[source]¶

clarity.predictor.torch_stoi module¶

Python

Navigation

Related Topics