Abstract
Lipreading has been rapidly developed recently with the help of large-scale datasets and big models. Despite the significant progress made, the performance of lipreading models still falls short when dealing with unseen speakers. Therefore by analyzing the characteristics of speakers when uttering, we propose a novel parameter-efficient fine-tuning method based on spatio-temporal information learning. In our approach, a low-rank adaptation module that can influence global spatial features and a plug-and-play temporal adaptive weight learning module are designed in the front-end and back-end of the lipreading model, which can adapt to the speaker’s unique features such as the shape of the lips and the style of speech, respectively. An Adapter module is added between them to further enhance the spatio-temporal learning. The final experiments on the LRW-ID and GRID datasets demonstrate that our method achieves state-of-the-art performance even with fewer parameters.
Type
Publication
In IEEE International Conference on Acoustics, Speech and Signal Processing, 2024
In this paper, we propose a novel speaker-adaptive lipreading model.
For front-end network, conv-based LoRA modules are used to adapt to speaker’s space features.
For back-end network, a plug-and-play TAWL module is designed to learn temporal characteristics.
An Adapter module is finally employed to bridge the adaptation knowledge from front-end and back-end.
The experiments show that the proposed method achieve the state-of-the-art performance on both word-level and sentence-level dataset with fewer training parameters.