Programme
Clarity-2023 will be a one-day workshop with a single track.
The morning will focus on hearing aid speech intelligibility prediction and will present the outcomes 2nd Clarity Prediction Challenge. The afternoon will focus on hearing aid speech enhancement including a presentation of plans for 3rd Clarity Enhancement Challenge.
All sessions will be in the McNabb Lecture Theatre, which is downstairs from the main reception. The first session will start at 9:00am, please arrive in good time to collect your name badge.
Timings and session details are provided below. All times are in Dublin local time (UTC+1).
9:00 | Welcome |
9:10 | Keynote 1 - Theme: Intelligibility Prediction - Fei Chen (SUSTech) |
10:00 | The Clarity Prediction Challenge Overview |
10:20 | Break - Coffee/Tea |
10:40 | Clarity Prediction Challenge Systems |
12:40 | Prizes and conclusions |
12:50 | Lunch |
13:30 | Hearing Aid Speech Enhancement - A user's perspective |
13:50 | Keynote 2 - Theme: Speech Enhancement - DeLiang Wang (Ohio State University) |
14:50 | Plans for the 3rd Clarity Enhancement Challenge |
15:10 | Discussion |
15:30 | Break - Coffee/Tea |
15:50 | Hearing Aid Speech Enhancement - Invited Talks |
17:30 | Close |
Keynote 1
Fei Chen SUSTech, China
Objective speech intelligibility prediction: Insights from human speech perception
Objective speech intelligibility prediction: Insights from human speech perception
Abstract
Speech intelligibility assessment plays an important role in speech and hearing studies. Designing a computational speech intelligibility model can significantly facilitate our studies, e.g., speech enhancement and speech coding. While many objective speech intelligibility prediction models are available, there are still challenges towards improving the prediction performance of intelligibility indices. Human speech perception studies provide us not only knowledge on various (e.g., acoustic, linguistic) impacts on speech understanding in different listening environments, but also insights on design a reliable objective intelligibility prediction index. In this talk, I will first introduce studies on important acoustic cues on human speech perception. Then, I will review the design of some existing intelligibility prediction models and efforts to improve their prediction power. Finally, I will briefly introduce new developments towards objective speech intelligibility prediction, e.g., machine learning and neurophysiological measurement methods.
Bio
Fei Chen (Senior Member, IEEE) received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University, Nanjing, China, in 1998 and 2001, respectively, and the Ph.D. degree from Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, in 2005. He continued his research as Postdoctor and Senior Research Fellow with the University of Texas at Dallas, supervised by Prof. Philipos Loizou, and The University of Hong Kong, Hong Kong. He is currently a Full Professor with the Department of Electrical and Electronic Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China. Dr. Chen is leading the speech and physiological signal processing Research Group in SUSTech. He has authored or coauthored more than 100 journal papers and more than 100 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America. His research interests include speech communication and assistive hearing technologies, brain-computer interface, and biomedical signal processing. He was tutorial speaker of Interspeech2022, Interspeech2020, EUSIPCO2022, APSIPA2021, and APSIPA2019, and organized special session Signal processing for assistive hearing devices at ICASSP2015. Dr. Chen is an APSIPA distinguished Lecturer (2022-2023), and is currently an Associate Editor for Biomedical Signal Processing and Control and Frontiers in Human Neuroscience.
Clarity Prediction Challenge papers
10:00-10:20 | The 2nd Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction [PDF] [Google Slides] (1University of Sheffield; 2University of Nottingham; 3University of Salford; 4Cardiff University) |
10:40-10:57 | A Non-Intrusive Speech Intelligibility Prediction Using Binaural Cues and Time-Series Model with One-Hot Listener Embedding [Paper] (Japan Advanced Institute of Science and Technology, Japan) |
10:57-11:14 | Deep Learning-based Speech Intelligibility Prediction Model by Incorporating Whisper for Hearing Aids [Paper] [Slides] (1Academia Sinica, Taiwan; 2National Taiwan University) |
11:14-11:31 | Prediction of Behavioral Speech Intelligibility using a Computational Model of the Auditory System [Paper] (1University of Texas at Dallas, US; 2Chittagong University of Engineering and Technology, Bangladesh) |
11:31-11:48 | Combining Acoustic, Phonetic, Linguistic and Audiometric data in an Intrusive Intelligibility Metric for Hearing-Impaired Listeners [Paper] [Slides] (University College London, UK) |
11:48-12:05 | A Non-intrusive Binaural Speech Intelligibility Prediction for Clarity-2023 [Paper] [Slides] (AI Lab, CyberAgent, Inc., Japan) |
12:05-12:22 | Pre-Trained Intermediate ASR Features and Human Memory Simulation for
Non-Intrusive Speech Intelligibility Prediction in the Clarity Prediction Challenge 2 [Paper] [Slides] (University of Sheffield, UK) |
12:22-12:40 | Temporal-hierarchical features from noise-robust speech foundation models for non-intrusive intelligibility prediction [Paper] [Slides] (Université de Toulon, Aix Marseille Université, France) |
Invited talk
15:30-15:50 | Hearing Aid Speech Enhancement - A User's Perspective (Chime, Ireland) |
Keynote 2
DeLiang Wang Ohio State University, US
Neural Spectrospatial Filtering
Neural Spectrospatial Filtering
Abstract
As the most widely-used spatial filtering approach for multi-channel signal separation, beamforming extracts the target signal arriving from a specific direction. We present an emerging approach based on multi-channel complex spectral mapping, which trains a deep neural network (DNN) to directly estimate the real and imaginary spectrograms of the target signal from those of the multi-channel noisy mixture. In this all-neural approach, the trained DNN itself becomes a nonlinear, time-varying spectrospatial filter. How does this conceptually simple approach perform relative to commonly-used beamforming techniques on different array configurations and in different acoustic environments? We examine this issue systematically on speech dereverberation, speech enhancement, and speaker separation tasks. Comprehensive evaluations show that multi-channel complex spectral mapping achieves very competitive speech separation results compared to beamforming for different array geometries, and reduces to monaural complex spectral mapping in single-channel conditions, demonstrating the versatility of this new approach for multi-channel and single-channel speech separation. In addition, such an approach is computationally more efficient than popular mask-based beamforming. We conclude that this neural spectrospatial filter provides a broader approach than traditional and DNN-based beamforming.
Bio
DeLiang Wang received the B.S. degree and the M.S. degree from Peking (Beijing) University and the Ph.D. degree in 1991 from the University of Southern California all in computer science. Since 1991,he has been with the Department of Computer Science & Engineering and the Center for Cognitive and Brain Sciences at The Ohio State University, where he is a Professor and University Distinguished Scholar. He received the U.S. Office of Naval Research Young Investigator Award in 1996, the 2008 Helmholtz Award and 2020 Ada Lovelace Service Award from the International Neural Network Society (INNS), the 2007 Outstanding Paper Award of the IEEE ComputationalIntelligence Society and the 2019 Best Paper Award of the IEEE Signal Processing Society. He is an IEEE Fellow and ISCA Fellow. He currently serves as Co-Editor-in-Chief of Neural Networks, and a member of the INNS Board of Governors.
Clarity Enhancement Challenge plans
15:30-15:50 | CEC3 plans and discussion (1University of Sheffield; 2University of Nottingham; 3University of Salford; 4Cardiff University) [Slides] |
Invited talks
15:50-16:15 | Project Aria: Investigating Ego-Centric Hearing Augmentation (Meta Reality Labs, US) |
16:15-16:40 | Application of AI-based Signal processing to assistive hearing solutions (Sonova AG, Switzerland) |
16:40-17:00 | Voice Conversion for Lombard Speaking Style with Implicit Acoustic Feature Conditioning (Imperial College London, UK) |
17:00-17:20 | Designing the Audio-Visual Speech Enhancement Challenge (AVSEC) (University of Edinburgh, UK) |
17:20-17:30 | The COG-MHEAR project - Towards cognitively-inspired 5G-IoT enabled, multi-modal Hearing Aids [Slides] (University of Nottingham, UK) |
Daniel WongMeta Reality Labs, US
Project Aria: Investigating Ego-Centric Hearing Augmentation
Project Aria: Investigating Ego-Centric Hearing Augmentation
Abstract
Augmented reality glasses provide a practical wearable form-factor for speech enhancement that can leverage multi-microphone processing technology and sensor fusion. One application that Meta Reality Labs Research is focusing on is context-aware hearing augmentation in noisy environments. To help tackle this challenge, Project Aria provides a data-gathering platform for investigating the problem space of ego-centric scene understanding, user understanding and speech enhancement. In this talk, I will discuss the platform and some of the most recent work from Meta on ego-centric hearing augmentation.
Peter DerlethSonova AG, Switzerland
Application of AI-based Signal processing on assistive hearing solutions
Application of AI-based Signal processing on assistive hearing solutions
Abstract
Assistive hearing solutions come in a variety of form factors, are designed to serve various use cases, are targeted at different user groups and are distributed to the market as consumer or medical product. Each of the mentioned aspects influences if a technological/functional innovation reaches the respective market segment and get’s the chance to improve the daily life of human listeners. The presentation will shed a light on existing and near future (AI-based) hearing aid technology.
Bio
Dr. Peter Derleth (*1968); Degree in applied physics (1995); PhD in Psychoacoustics (1999) University of Oldenburg, Germany. Since 2000 employed at Sonova AG, Switzerland. Position: Principal Expert ‘Hearing Performance’ which covers the fields of acoustics, audiology, algorithmic research and performance profiling. Research topics range from acoustic stability enhancement (Feedback Cancelling) over directional (Beam Forming) and spectral algorithms (Gain Models, Noise Cancelling, Frequency Manipulations) to binaural and psychoacoustic effects. Latest focus field covers applications of AI-based Signal- processing for improved Hearing Performance.Dominika WoszczykImperial College London, UK
Voice Conversion for Lombard Speaking Style with Implicit Acoustic Feature Conditioning
Voice Conversion for Lombard Speaking Style with Implicit Acoustic Feature Conditioning
Dominika C Woszczyk (Imperial College London); Sam Ribeiro (Amazon Alexa); Thomas Merritt (Amazon); Daniel Korzekwa (Nvidia)Abstract
Lombard speaking style in Text-to-Speech (TTS) systems can enhance speech intelligibility and be advantageous in noisy environments and for individuals with hearing loss. However, training such models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation technique to train TTS systems when data from the target speaker in the desired speaking style is unavailable. Our focus in this study is on Lombard speaking style conversion, aiming to convert speaker identity while retaining the distinctive acoustic characteristics of the Lombard style. We compare voice conversion models with implicit and explicit acoustic feature conditioning. Our results show that our implicit conditioning strategy achieves an intelligibility gain comparable to the model conditioned on explicit acoustic features, while also preserving speaker similarity.
Lorena AldanaUniversity of Edinburgh, UK
Designing the Audio-Visual Speech Enhancement Challenge (AVSEC)
Designing the Audio-Visual Speech Enhancement Challenge (AVSEC)
Abstract
The Audio-Visual Speech Enhancement Challenge (AVSEC) sets the first benchmark in the field of audio-visual speech enhancement, providing a carefully designed dataset and scalable protocol for human listening evaluation of AV-SE systems. AV scenes comprise audio and video of a target speaker mixed with an interferer that can be either noise or a competing speaker. Target speaker videos are selected from LRS3. AV-SE systems are evaluated in terms of intelligibility from listening tests with human participants. To evaluate the systems, we propose a scalable and efficient method to assess intelligibility from “in-the-wild stimuli” that does not require a specific sentence structure. This talk will present the scope and limitations of current design choices in AVSEC.
Michael AkeroydUniversity of Nottingham, UK
The COG-MHEAR project - Towards cognitively-inspired 5G-IoT enabled, multi-modal Hearing Aids
The COG-MHEAR project - Towards cognitively-inspired 5G-IoT enabled, multi-modal Hearing Aids
Michael A Akeroyd (University of Nottingham), Amir Hussain (Edinburgh Napier), Peter Bell (Edinburgh), Ahsan Adeel (Wolverhampton), Qammar Hussain Abbasi (Glasgow), Steve Renals (Edinburgh), Tughrul Arslan (Edinburgh), Tharmalingam Ratnarajah (Edinburgh), Lynne Baillie (Heriot-Watt), Mathini Sellathurai (Heriot-Watt) , Muhammad Imran (Glasgow), Emma Hart, (Edinburgh Napier), Ahmed Al-Dubai, (Edinburgh Napier), William Buchanan (Edinburgh Napier), Alexander Casson (Manchester), & Dorothy Hardy (Edinburgh Napier)Abstract
The lack of take-up of hearing aids, their use, their stigma, the effort required to use them, and the limitations in what they can do for speech enhancement remain fundamental problems for auditory research. The COG-MHEAR project is a 4-year EPSRC-funded project that is taking a transformative, interdisciplinary approach to address some of these issues. We are creating prototypes of multi-modal aids which not only amplify sounds but also use information collected from a range of sensors to improve understanding of speech, including visual information of the movements of the speaker's lips, hand gestures, and similar. But such devices bring challenges in preserving privacy and operating with minimum power and minimum delay. In this talk we will give an overview of the project, some of the results, and some of the challenges.