We are honored to welcome the following keynote speakers to present at the conference:
Speaker 1: Prof. Eng Siong Chng, Nanyang Technological University (NTU), Singapore
Biography
Title: Enabling LLM for ASR
Abstract:
The decoder-only LLM, such as ChatGPT, was originally developed to accept only text input. Recent advances have enabled it to handle other modalities, such as audio, video, and images. This talk focuses on integrating speech modality into LLMs. The research community has proposed various innovative approaches for this task, including applying discrete representations, integrating pre-trained encoders with existing LLM decoder architectures (e.g., Qwen), multitask learning, and multimodal pretraining. In the talk, I will review recent approaches to the ASR task using LLMs and introduce two works from NTU’s Speech Lab: (i) “Hyporadise,” which applies LLMs to N-best hypotheses generated by traditional ASR models to improve the top-1 transcription result, demonstrating that LLMs not only exceed the performance of traditional language model rescoring but also recover and generate correct words not found in the N-best hypothesis—an ability we call Generative Error Correction (GER); and (ii) leveraging LLMs for ASR and noise-robust ASR by extending the Hyporadise approach to include noisy language embeddings, capturing the diversity of N-best hypotheses under low SNR conditions, and showing improved GER performance with fine-tuning.
Speaker 2: Prof. Xipeng Qiu, Fudan University
Biography
Title: From Large Language Model to World Model
Abstract:
Large language models (LLMs) still have certain shortcomings in handling complex, multimodal and long-term memory tasks. In order to solve the above limitations, we can let LLM interact with the actual environment for continuous learning, so that LLM becomes a world model, overcoming some of the current limitations of LLM in tasks that require understanding of the physical and social world. However, compared with LLMs, the technical route of world models is not clear, and the future development path is still controversial. This talk mainly discusses how to improve the capabilities of LLMs from the perspectives of multilingual and multimodal expansion, embodied learning, etc., so as to achieve a world model. This report will also report some of the latest research progress of the LLM MOSS2.