


Paper ID | Session Name | Title | Authors | Authors’ affiliations | Abstract |
---|---|---|---|---|---|
001 | Human Robot Interaction | Human-Robot Pose Tracking Based on CNN with Color and Geometry Aggregation | Yue Xu, Yinlong Zhang, Yuanhao Liu, Wei Liang, Hongsheng He | Shenyang University of Technology; Guangzhou Institute of Industrial Intelligence; Shenyang Institute of Automation Chinese Academy of Sciences; University of Chinese Academy of Sciences; The University of Alabama | Accurately tracking the robotic arm and human joints is crucial to ensure safety during human-robot interaction. However, traditional pose tracking methods often exhibit insufficient performance and robustness in complex environments. The variations in the robotic arm’s environment and posture make it challenging for traditional methods to accurately capture the positions and posture of its joints. Specifically, when addressing challenges such as high similarity, occlusion, background complexity, and joint recognition failures, they often struggle to provide reliable and accurate results. To address these challenges, this paper proposes a human-robot pose tracking algorithm based on a new convolutional neural network model. To enhance detection accuracy, an improved color detection module is introduced to resolve joint misclassification. A geometric perception module is designed to accurately locate joints even under occlusion. Additionally, innovative iEMA and DBB modules are incorporated. The iEMA module employs edge detection technology to dynamically adjust thresholds for correct matching by improving the edge matching process. The DBB module refines the boundary box parameters for precise localization by introducing adaptive bounding box updates in real-time. This algorithm also integrates human pose recognition, enabling real-time pose recognition, thereby facilitating more intelligent and natural human-robot interaction. The algorithm has been rigorously evaluated on a custom-designed robotic arm platform. Experimental results validate the algorithm’s effectiveness and feasibility. |
002 | Human Robot Interaction | The Impact of Synchronized Visual and Auditory Attention on Human Perception | Lichuan Jiang, Jiani Zhong, Muqing Jian, Xuanzhuo Liu, Siqi Cai, Haizhou Li | The Chinese University of Hong Kong; Technical University of Munich, Germany; University of Bremen | The cocktail party problem shows the remarkable human ability to selectively attend to and recognize one source of auditory input in a noisy environment. However, individuals may struggle to identify a speaker’s voice when they are unfamiliar with the speakers and don’t have a clear visual focus, which results in less visual information. This raises the question: How can visual information aid in extracting information from a speaker’s voice? This study explores how synchronized visual and auditory attention impact human perception in scenarios in- volving two speakers. Using Tobii Glasses 3 to track participants’ eye movements and pupil diameters, combined with questionnaire responses, we explore how these factors influence speech comprehension. Our re- sults demonstrate that participants achieve greater accuracy in speech comprehension when they focus their gaze on the speaker they are listening to, compared to scenarios where visual attention is divided between speakers or where they rely solely on auditory cues. These findings highlight the effectiveness of synchronizing visual and auditory attention in improving the acquisition and processing of information. |
003 | Soft Robots 1 | Drift-Free Ionotronic Sensing | Canhui Yang | Southern University of Science and Technology | Skin-like soft pressure sensors enable artificial haptic technologies for myriad applications in robotics, healthcare, and beyond. A soft sensor must detect pressure with both high sensitivity and high accuracy. However, for existing soft pressure sensors, viscoelastic creep of the soft materials causes signal drift, resulting in unreliable measurements that might lead to an incorrect trigger or safety concerns. Among the many types of soft pressure sensors, ionotronic sensors exhibit superior sensing properties owing to the nanoscale charge separation at the electric double layer. However, signal drift is particularly prevalent in ionotronic sensors owing to leakage of the ionic solvent, in addition to the viscoe-lastic creep. This talk will introduce our recent advances in realizing drift-free ionotronic sensing. We do so by designing and copolymerizing a leakage-free and creep-free polyelectrolyte elastomer containing two types of segments: charged segments having fixed cations to prevent ion leakage and neutral slippery segments with a high crosslink density for low creep. We show that an ionotronic sensor using the polyelectrolyte elastomer barely drifts under an ultrahigh static pressure of 500 kPa (close to its Young’s modulus), exhibits a drift rate two to three orders of magnitude lower than that of the sensors adopting conventional ionic conductors and enables steady and accurate control for robotic manipula-tion. Such drift-free ionotronic sensing potentializes highly accurate sensing in robotics and beyond. |
004 | Soft Robots 1 | Harnessing Mechanical Instability and Viscosity for Embedded Control and Sensing in Soft Robots | Lishuai Jin | Over the past decades, there have been considerable advancements in soft robots that attempt to bridge the gap between conventional machines with high-performance but rigid components and biological organisms with remarkable versatility and adaptability. In this talk, I will introduce two strategies to shed light on the longstanding challenges in pneumatic-driven soft robots. First, we develop multifunctional robots that can operate with a single pressure input by coupling actuators with passive valves that can harness the flow characteristics to create functionality. Second, we geometrically design a class of metacaps by introducing an array of ribs to a spherical cap to realize programmable bistabilities and snapping behaviors. These metacaps enable soft robots with embedded control and sensing capabilities, including sensor-less grippers and autonomous swimming robots, facilitating the design of next-generation soft robots with high transient output energy and electronics-free maneuvering. | |
005 | Robot Control | SegmentAnything-Based Approach to Scene Understanding and Grasp Generation | Songting Liu, Zhu Haiyue, Zhezhi Lei, Jun Ma, Zhiping Lin | Harbin Engineering University; National University of Singapore; The Hong Kong University of Science and Technology (Guangzhou) | Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. To solve these challenges, this research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, which is extended from SegmentAnything model. The GraspAnything model, based on SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the mask of objects and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. Experiment results have demonstrate the effectiveness of our model in grasp detection tasks. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent Autonomous robot grasping in multi-object scenarios poses significant challenges, requiring precise grasp candidate detection, determination of object-grasp affiliations, and reasoning about inter-object relationships to minimize collisions and collapses. To solve these challenges, this research presents a novel approach to address these challenges by developing a dedicated grasp detection model called GraspAnything, which is extended from SegmentAnything model. The GraspAnything model, based on SegmentAnything (SAM) model, receives bounding boxes as prompts and simultaneously outputs the mask of objects and all possible grasp poses for parallel jaw gripper. A grasp decoder module is added to the SAM model to enable grasp detection functionality. Experiment results have demonstrate the effectiveness of our model in grasp detection tasks. The implications of this research extend to various industrial applications, such as object picking and sorting, where intelligent robot grasping can significantly enhance efficiency and automation. The developed models and approaches contribute to the advancement of autonomous robot grasping in complex, multi-object environments. |
006 | Soft Robots 1 | Modular Embodiment of Control in Pneumatic Soft Robots | Qiguang He | The Chinese University of Hong Kong | Pneumatic soft robots typically interact with their environments via feedback loops consisting of electronic sensors, microcontrollers, and actuators, which can be bulky and complex. Researchers have sought new strategies for achieving autonomous sensing and control in next-generation soft robots. Here, we demonstrate electronics-free approaches for autonomous control of soft robots, whose compositional and structural features embody the sensing, control, and actuation feedback loop of their soft bodies. Specifically, we design multiple modular control units regulated by responsive materials such as liquid crystal elastomers. These modules enable the robot to sense and respond to external stimuli (light, heat, and solvents), causing autonomous changes to the robot’s trajectory. By combining multiple control modules, complex responses can be achieved, such as logical evaluations that require multiple events to occur in the environment before an action is performed. This framework for embodied control offers a new strategy for autonomous soft robots that operate in uncertain or dynamic environments. |
007 | Human Robot Interaction | Potential-Field-Based Motion Planning for Social Robots by Adapting Social Conventions | Ziwei Yin, Zhonghao Zhang, Wanyue Jiang, Shuzhi Sam Ge | Qingdao University; National University of Singapore | Social robot behavior should conform to human social conventions. Social conventions concerning the social distance for interaction, the silence distance for non-disturbing, the safety distance for avoiding collision, the left-side passing-by preference, and the face-to-face communication rule are embedded in the motion planning procedure. Potential-field-based motion planning algorithms are designed in this paper, which not only considers the above-mentioned social conventions but also takes stationary obstacles and pedestrian avoidance into account. Simulations in different cases are conducted to verify both the effectiveness of the potential field and the compliance with the social conventions. |
008 | Soft Robots 1 | Bioinspired Soft Robotics: Research on Magnetic Field-Regulated Flexible Actuators for Caterpillar-Inspired Locomotion | Huimin Zhu, Qi Chen, Weitian Zhang, Dongning Gao, Hongmiao Tian | Xi’an Jiaotong University | In the context of the widespread application of robotics across various industries, traditional rigid robots are limited by their mechanical structure and drive systems, rendering them inadequate for complex working conditions. In contrast, soft robots, composed of flexible materials, offer greater degrees of freedom and environmental adaptability. This study aims to explore the advantages of soft robots and proposes a magnetically-driven soft robot inspired by the motion of a caterpillar. This is achieved through the coupling of a magnetically-driven flexible membrane actuator and a caterpillar-inspired flexible structure, enabling adaptation to complex working conditions. The study utilized materials with magnetic properties, such as neodymium-iron-boron, which underwent low-dimensional processing, surface modification, and bonding treatments to prepare the magnetically-driven flexible membrane actuator. Subsequent testing and optimization resulted in the selection of a high-performance actuator. Following this, a soft robot structure mimicking the crawling motion of a caterpillar was designed and validated for feasibility. Ultimately, through the optimization of drive and structural parameters, the robot’s crawling speed was enhanced, and its adaptability under different conditions was explored. The latest research findings demonstrate that the coupling of the magnetically-driven flexible membrane actuator with the caterpillar-inspired flexible structure exhibits broad adaptability and excellent motion performance. This study provides new insights and approaches for the development of soft robots, offering crucial references for the performance optimization of soft robots in practical applications, with the potential to advance their utilization in complex working condition. |
009 | Soft Robots 1 | Self-Propelling Tensegrity Structure | Changyue Liu, Kai Li, Zhijian Wang | Beihang University; Anhui Jianzhu University | Tensegrity structures are a type of soft modular structures that combine stiff compressive rods and soft tensile cables, exhibiting unique characteristics such as large deformability, high stiffness-to-mass ratio, substantial load-bearing capacity and exceptional impact resistance. The active and dynamic tensegrity designs show great potential for soft robots. Herein, by integrating thermally responsive cables, non-responsive cables and stiff rods, we construct a hybrid tensegrity structure which can be self-propelled to move continuously on a hot surface. Owing to its special geometry, the hybrid tensegrity structure can easily realize multimodal self-propelling locomotion, which is very challenging for self-propelling structures demonstrated previously. Last, we construct a modulable and re-assemble tensegrity structure using the Velcro tapes to adhere the rods and cables together. We envision the scalable and modulable tensegrity structure could be beneficial for the construction in the fields in soft robotics and planetary exploration. |
010 | Human Robot Interaction | Diverse Gaussian Sampling for Human Motion Prediction | Jiefu Luo, Jiansheng Wang, Zhenfei Liu, Jun Cheng | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; The Chinese University of Hong Kong | In many industrial applications, it is necessary to obtain accurate human motion prediction. The prediction of human movement based on a 3D skeleton, which forecasts future postures using the human skeletal structure, significantly aids in enabling machines to understand human behavior and react more intelligently. Additionally, by predicting human actions, potential dangers can be perceived, especially in safety-related issues, necessitating a broader range of predictions. In this paper, considering the greater diversity generated by Variational Autoencoders (VAEs), we employ VAEs for prediction. In contrast to other studies, our primary focus is on obtaining more diverse samples. Extensive experiments demonstrate that our proposed model performs well on the Human 3.6M [37] and HumanEva-I [38] datasets. |
011 | Al for Good | Multi-Scale Separable Convolution and Dilated Attention for Machinery Fault Diagnosis | Lijun Zhang, Dong Qiu, Jinjia Wang | Yanshan University | In recent years, convolutional neural networks (CNNs) have significantly advanced fault diagnosis, enhancing the performance of intelligent diagnostic models. However, due to complex noise in industrial environments, relying solely on CNNs’ local feature extraction is insufficient for obtaining comprehensive fault information. Therefore, this paper proposes a mechanical fault diagnosis model that integrates multi-scale separable convolution and dilated attention mechanisms(MSCDA). This approach leverages the local and global feature extraction capabilities of CNNs and Transformers to improve diagnostic performance in noisy environments. The model first extracts local features from vibration signals using multi-scale separable convolution(MSC), then employs a multi-scale dilated attention(MSDA) mechanism to capture additional feature information. Experimental results demonstrate that the proposed method outperforms other CNN and Transformer-based fault diagnosis methods in terms of accuracy and robustness, validating its superiority. |
012 | AI for Mental Health | CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare | Jingwei Zhu, Minghuan Tan, Min Yang, Ruixue Li, Hamid Alinejad-Rokny | University of Science and Technology of China; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; Xiangshui County Party School; The University of New South Wales | The rapid progress in Large Language Models (LLMs) has prompted the creation of numerous benchmarks to evaluate their capabilities. This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB)~\cite{wang2023cmb}, showcasing how dataset diversity and distribution in supervised fine-tuning (SFT) may enhance LLM performance. Remarkably, We successfully trained a smaller base model to achieve scores comparable to larger models, indicating that a diverse and well-distributed dataset can optimize performance regardless of model size. This study suggests that even smaller models may reach high performance levels with carefully curated and varied datasets. By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies. Our results imply that a broader spectrum of training data may enhance a model’s ability to generalize and perform effectively across different medical scenarios, highlighting the importance of dataset quality and diversity in fine-tuning processes. |
013 | Robot Control | Tolerant Tracking Control Protocol for PMSM Based on Policy Iteration Algorithm and Fault Compensation | Shuya Yan, Xiaocong Li, Huaming Qian, Jun Ma, Abdullah Al Mamun | Nanyang Technological University; National University of Singapore; Agency for Science, Technology and Research (A*STAR); The Hong Kong University of Science and Technology (Guangzhou) | In robotics applications, ensuring reliable performance in the presence of actuator faults is essential for maintaining system safety and reliability. This paper presents a tolerant tracking control method for permanent magnet synchronous motors (PMSMs) based on adaptive dynamic programming and fault compensation. The method simultaneously considers tracking accuracy and energy consumption through a policy iteration algorithm. In the optimality analysis of the algorithm, more relaxed conditions are provided to demonstrate that the performance function can converge to a near-optimal value within a finite number of iterations. In practical implementation, an actor-critic network is used to approximate the performance function and control protocol, alongside a fault detection mechanism based on an expanded time horizon, which achieves fault detection from arbitrary initial values. The effectiveness of the proposed algorithm is verified using a high-fidelity PMSM model in Simulink. |
014 | Al for Good | Content-guided Efficient Learner for Audio-Visual Emotion Recognition | Guanjie Huang, Weilin Lin, Li Liu | The Hong Kong University of Science and Technology | Audio-Visual Emotion Recognition (AVER) is essential in various real-world applications. A large amount of methods try to extract and fuse the audio and visual modalities to better comprehend and classify the underlying emotion. Recently, large pre-trained models brought powerful modality-fusion ability in general datasets and outperformed the traditional small-scale models significantly. However, they are less effective in complementarity of some specialized scenarios, due to the conflict of meanings among two modalities. In this paper, we propose an efficient fine-tuning method, Content-guided Audio-Visual efficient Learner (C-AVeL), to solve this problem with subtle computational consumption. Specifically, we propose an adapter network upon the pre-trained audio and visual transformers for modality fusion. To better fuse the two modalities, we propose to use content tokens with latent attention, where the audio and visual information are aligned under the guidance of the specific content. Extensive experiments on the CREMA-D dataset verify the effectiveness and efficiency of our proposed framework. |
015 | Soft Robots 1 | Mechanics and Design of a Bio-Inspired Self-Rolling Robot | Yutang Zhou, Xudong Liang | Harbin Institute of Technology, Shenzhen | Drosophila larvae, as soft-bodied and segmented animals, can swiftly escape from dangers by continuously rolling in a “C-shaped” bend. Traditional rolling robots, designed for applications such as surveillance and cleaning, often require complex controllers and sophisticated structures to maintain their rolling motion. Here, we present a rolling mechanism inspired by the continuous rolling observed in Drosophila larvae. We show that the body curvature developed along the rolling direction, and the sequential axial muscle contraction is synchronized with body rotation. Inspired by the continuous rolling mechanism of Drosophila larvae, we present a novel self-rolling robot that is composed of multiple segments connected by elastic linkages, each equipped with an independent motor drive. Through the interaction of collective segments, the “C-shaped” deformation of the robot is realized in order to perform a continuous self-rolling motion. We evaluate the robot’s performance with varying numbers of segments and degrees of bending. The results demonstrate the effectiveness of our design in achieving stable and efficient rolling motion. This work opens new avenues for the development of bioinspired rolling robots with simplified control mechanisms and structures. |
016 | AI for Mental Health | Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration | Shiwen Ni, Min Yang | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences | Intelligent dialogue systems are increasingly used in modern education and psychological counseling fields, but most existing systems are limited to a single domain, cannot deal with both educational and psychological issues, and often lack accuracy and professionalism when dealing with complex issues. To address these problems, this paper proposes an intelligent dialog system that combines educational and psychological counseling functions. The system consists of multiple AI agent, including security detection agent, intent identification agent, educational LLM agent, and psychological LLM agent, which work in concert to ensure the provision of accurate educational knowledge Q\&A and psychological support services. Specifically, the system recognizes user-input intentions through an intention classification model and invokes a retrieval-enhanced educational grand model and a psychological grand model fine-tuned with psychological data in order to provide professional educational advice and psychological support. |
017 | Multi-modal Models | Semi-Supervised Speaker Localization with Gaussian-like Pseudo-labeling | Xinyuan Qian, Chen Lu, Yating Zhang, Kainan Chen, Haizhou Li | University of Science and Technology Beijing; Eigenspace GmbH; The Chinese University of Hong Kong | Speaker localization is important for many human-robot interaction applications. Most existing localization studies train models with fully-annotated data through supervised learning strategies, which doesn’t fit the real-world scenarios where labeled data is scarce. To address the challenge of limited labeled data, we propose a semi-supervised deep learning algorithm for Direction of Arrival (DoA) estimation. Specifically, the model is enhanced through a process where it generates pseudo-labels for unlabeled data and incorporates a sophisticated filtering mechanism. This refined approach retrains the model by integrating both the labeled data and the data enriched with these pseudo-labels, thereby optimizing its learning capabilities. Experimental results show that it outperforms both supervised learning algorithms and other semi-supervised learning algorithms. |
018 | Al for Good | MCCS: The First Open Multi-Cuer Mandarin Chinese Cued Speech Dataset and Benchmark | Li Liu, Lufei Gao, Wentao Lei, Yuzhi He, Yuxing He, Che Feng, Yue Chen, Zheyu Li | The Hong Kong University of Science and Technology (Guangzhou); The Experimental High School Attached to Beijing Normal University; University of Edinburgh; University of Nottingham Ningbo China | Cued Speech (CS) is an augmented lip reading system complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition and generation can help communications between the deaf people and others. The CS dataset is essential for establishing an AI-based automatic recognition and generation model for CS. Previous CS datasets were mainly in English and French, and the data volume was small with a single cuer (i.e., people who perform CS), which hinder research progress in this field. Therefore, we have constructed, for the first time, a Mandarin Chinese CS Dataset (MCCSD) containing 4000 CS videos from four native Chinese CS cuers. Importantly, we propose a novel GAN-based CS Video Gesture Generation baseline for the first time. To further validate the effectiveness of this dataset, we build a benchmark for both automatic CS video recognition and generation. Experimental results demonstrate that MCCS serves as a valuable benchmark for CS recognition and generation, presenting new challenges and insights for future research. The complete dataset, benchmark, and source codes, will be made publicly available. |
019 | Medical Robots | Design, Modelling, and Testing of Soft-Rigid Integrated Upper Limb Exoskeleton | Yinan Li, Zihan Pu, Yanqiong Fei | Shanghai Jiao Tong University; Shenzhen Research Institute of Shanghai Jiao Tong University | This paper presents the design, modelling, and testing of a novel soft-rigid integrated upper limb exoskeleton (SR-ULE). The SR-ULE is made of multi-directional actuator, unidirectional actuator, and wearable accessories (actuator sheath, back base and etc.). The SR-ULE can assist users with shoulder abduction/adduction, shoulder flexion/extension and elbow bending/extension movements. From the perspective of material mechanics, the mechanical equilibrium analytical equation for the multi material fusion of the actuator is obtained. This equation directly explains the analytical relationship between the working pressure, bending curvature, and output torque of the actuator, which has important guiding significance for the fabrication and use of the actuator. This paper obtained the performance parameters of the actuator through free bending test and mechanical performance test. The human wearing test has proven that SR-ULE has the ability to assist in upper limb movement. |
020 | Multi-modal Models | A Review of Human Mesh Reconstruction: Beyond 2D Video Object Segmentation | Peng Wu, Zhicheng Wang, Feiyu Pan, Fangkai Li, Hao Hu, Xiankai Lu, Yiyou Guo | Shandong University; Quanzhou Normal University | Video object segmentation aims to extract 2D object masks by partitioning video frames into multiple objects which dominates in numerous practical applications such as medical Imaging, etc. However, traditional video object segmentation methods predict 2D mask which is not compatible with 3D cenarios that matters depth information, such as robotic grasping, virtual reality, autonomous driving, etc. In this paper, we provide a systematic review of 3D human mesh reconstruction (HMR) beyond 2D video object segmentation. Firstly, we review the mainstream video object segmentation methods. Afterward, we transition from 2D video object segmentation to 3D HMR. We further categorize recent HMR methods along main characteristics that underline this research field, including the types of model input and the employment of statistical model. Finally, we provide the details of HMR datasets and evaluation metrics. |
021 | Human Robot Interaction | Am I a Social Buddy? A Literature Review on Socially Appealing Design and Implementation Methods for Social Robots | Andreea Niculescu, Kheng Hui Yeo | Institute for Infocomm Research; ASTAR Research Entities | This paper reviews socially appealing design and implementation methods for social robots published between 2020 and 2024, focusing on three critical traits: human-like communication, empathy and emotional intelligence, and personality. Analyzing 29 recent empirical studies, we highlight key trends in human-robot interaction (HRI). Recent advancements in natural language processing (NLP) and multimodal interaction, such as Large Language Models (LLMs) and context-aware frameworks, have significantly improved robots’ ability to handle complex conversations and interact effectively in multi-party settings. Additionally, Generative Adversarial Networks (GANs) have enhanced robots’ expressiveness by generating non-verbal cues like co-speech gestures. Advances in emotion recognition, including multimodal data fusion and physiological sensors, have led to robots that are more responsive, emotionally intelligent and have a pleasant personality. These developments indicate a shift towards robots that offer not only functional assistance but also emotional support, enhancing overall user satisfaction and engagement. |
022 | Robot Control | Optimization-Based Trajectory Planning for Autonomous Ground Vehicles | Haoran Xu, Qinyuan Ren | Zhejiang University | In social environments, complex interactive scenes and various tasks bring great challenges to the motion planning of autonomous ground vehicles. Application scenarios typically require vehicles to plan a smooth trajectory in real time that takes the shortest amount of time and conforms to all constraints. The construction of an optimal control problem in the state space is a common approach in this field, however, this inevitably entails a compromise between the optimality of the trajectories and the computational efficiency. The proposed method formulates a trajectory optimization problem based on differential flatness theory, which realizes efficient obstacle avoidance while satisfying the nonholonomic constraints of the ground vehicles. The representation of trajectories is simplified and a trajectory planning problem is constructed in the differential flatness space of vehicles. Furthermore, safe driving corridors are utilized to achieve smooth obstacle avoidance. The output trajectories are tracked by a model predictive controller for deployment on autonomous ground vehicles. Experiments in both simulation and real-world are conducted to demonstrate the feasibility of the algorithms in complex scenarios. |
023 | Robot Control | Mapless Navigation in Factory Environments with Safe RL Approach | Junyi Hou, Qinyuan Ren | Zhejiang University | Navigation of Automated Guided Vehicles (AGVs) in factory environments is a vital application scenario of autonomous systems. However, due to the complexity and variability of these environments, designing collision-free navigation strategies is highly challenging. To address this, this paper introduces a framework that leverages Deep Reinforcement Learning (DRL) for AGV navigation within factory environments. We prioritize safety throughout both the training phase and the execution of navigation strategies. Specifically, we employ a Signed Distance Function (SDF) to accurately represent the spatial relationship between the AGVs and obstacles, and integrating these constraints within the Markov Decision Process (MDP). The proposed method is evaluated in GAZEBO simulating environments, where AGVs navigate safely to reach designated storage racks. The experimental outcomes reveal that the proposed method is able to successfully learn obstacles avoiding navigation, exhibiting strong generalization capabilities across various environment settings. |
024 | AI for Mental Health | A Transformer-based Depression Detection Network Leveraging Speech Emotional Expression Cues | Changqing Xu, Xinyi Wu, Nan Li, Xin Wang, Feng Xu, Rongfeng Su, Nan Yan, Lan Wang | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peking University Shenzhen Institute; Good Mood Health Industry Group Co. | In recent years, significant progress has been made in automated depression detection methods using speech and text data combined with deep learning. However, few studies have explored the connection between depression and speech emotions. To address this issue, this paper proposes a novel Transformer-based network leveraging Speech Emotion Information (SEI) for depression detection. The proposed network consists of a primary network used for depression classification and an auxiliary network is employed for speech emotion classification. In the primary network, the pre-trained Hubert and RoBERTa are used to obtain the short-term acoustic and textual features, respectively. And then the long-term audio and text features are aggregated from the shot-term features by using Transformer-based approaches with average pooling. The SEI extracted from the auxiliary network serves as supplementary auxiliary features aimed at augmenting the precision of depression recognition. Based on the proposed method, our best experimental results achieved an accuracy of 76.10% at the subject level. The experimental results demonstrate that incorporating speech emotions in depression detection improves diagnostic accuracy, offering a new perspective for research in this area. |
025 | AI for Mental Health | Alzheimer’s Disease Detection Based on Large Language Model Prompt Engineering | Tian Zheng, Xurong Xie, Xiaolan Peng, Hui Chen, Feng Tian | Institute of Software,Chinese Academy of Sciences; University of Chinese Academy of Sciences | In light of the growing proportion of older individuals in our society, the timely diagnosis of Alzheimer’s disease has become a crucial aspect of healthcare. In this paper, we propose a non-invasive and cost-effective detection method based on speech technology. The method employs a pre-trained language model in conjunction with techniques such as prompt fine-tuning and conditional learning, thereby enhancing the accuracy and efficiency of the detection process. To address the issue of limited computational resources, this study employs the efficient LORA fine-tuning method to construct the classification model. Following multiple rounds of training and rigorous 10-fold cross-validation, the prompt fine-tuning strategy based on the LLAMA2 model demonstrated an accuracy of 81.31%, representing a 4.46% improvement over the control group employing the BERT model. This study offers a novel technical approach for the early diagnosis of Alzheimer’s disease and provides valuable insights into model optimization and resource utilization under similar conditions. It is anticipated that this method will prove beneficial in clinical practice and applied research, facilitating more accurate and efficient screening and diagnosis of Alzheimer’s disease. |
026 | Multi-modal Models | Synergized Twin Layer for Federated Action Recognition | Yanshu He | Beijing University of Posts and Telecommunications | In recent years, Federated Learning (FL) has gained significant traction as an effective solution for various computer vision applications, particularly due to its strengths in preserving data privacy and minimizing communication overhead. However, its application to advanced tasks like video-based action recognition introduces unique complications. Additionally, the client drift caused by data heterogeneity remains a significant challenge. To address this challenge, we introduce a novel framework, Synergized Twin Layer for Federated Action Recognition (STL-FAR), which leverages both local and global layers to harmonize client-specific patterns with a global model representation. Specifically, the STL-FAR framework is composed of two main components in Synergized Twin Layer (STL): a local classifier that adapts to local client data, enhancing the robustness and accuracy of action recognition on individual clients, and a global classifier that unifies the local models into a consolidated global model through federated averaging, thereby reducing inter-client discrepancies and improving overall model generalization. Moreover, we incorporate a Cloud-to-Client Knowledge Distillation (CCKD) mechanism within the framework, where the server supervises the client, ensuring consistent and robust performance across clients. We demonstrate the efficacy of STL-FAR through extensive experiments on two benchmark action recognition datasets. Our findings indicate that STL-FAR outperforms existing federated learning methods. This work advances federated action recognition and provides a promising solution to the issue of data heterogeneity in federated learning scenarios. |
027 | Multi-modal Models | Video Question Answering Based on Audio-Visual Hyper Graphs | Shuai Zhang | Beijing University of Posts and Telecommunications | In this paper, we tackle the Visual-Audio Question Answer- ing (VQA) task, which requires addressing questions about objects, sounds, and their interrelationships within videos. Effective VQA necessitates capturing the presence and evolving relationships of subjects and objects over time while leveraging both video and audio modalities. We propose a novel framework, Audio-Visual Hyper-Graph VQA (AVHG-VQA), which utilizes audio-visual modalities to construct situation audio-visual hyper-graphs(AVHG) for answering video-related questions. AVHG provide a structured representation by detailing sub-graphs for individual frames and connecting them with hyper-edges, encapsulating relevant information in a compact form. Our framework involves training a hyper-graph decoder to implicitly identify person-object relationships within video segments. We use cross-attention between predicted AVHG and question embeddings to determine answers. The training process includes two stages: first, extracting relationships from video frames with a pre-trained scene graph model and using these as labels; second, optimizing the model with cross-entropy and Hungarian matching loss functions. We extensively evaluate our framework on the challenging MUSIC-AVQA dataset, which focuses on video-audio modality information. |
028 | Al for Good | Controllable Talking Head Synthesis by Equivariant Data Augmentation for Spatial Coordinates | Wan Ding, Dong-Yan Huang, Zehong Zheng, Tianyu Wang, Linhuang Yan, Xianjie Yang, Penghui Li | Xinyang AI Technology Co. Ltd; UBTech Robotics Corp; Tsinghua Shenzhen International Graduate School | Traditional talking head synthesis algorithms decompose the lips and head pose information from the facial landmark points based on the spatial point registration. In this paper we show that the hypothesis, which the traditional point registration methods are dependent on, is too strong and it results in unnatural talking head results. Instead of registration, we propose a latent lips-headpose coding method.The proposed method conducts self-supervised learning and equivariant data augmentation to the facial landmark points. The experimental results show that the proposed latent lips-headpose coding method outperforms the traditional registration-based methods and can generate natural looking talking head with accurate mouth shapes. |
029 | AI for Mental Health | Structured Dialogue System for Mental Health: An LLM Chatbot Leveraging the PM+ Guidelines | Yixiang Chen, Xinyu Zhang, Jinran Wang, Xurong Xie, Nan Yan, Hui Chen, Lan Wang | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences; University of Chinese Academy of Sciences; Institute of Software, Chinese Academy of Sciences Beijing; East China Normal University; Wuhan Research Institute of Posts and Telecommunications | The Structured Dialogue System, referred to as SuDoSys, is an innovative Large Language Model (LLM)-based chatbot designed to provide psychological counseling. SuDoSys leverages the World Health Organization (WHO)’s Problem Management Plus (PM+) guidelines to deliver stage-aware multi-turn dialogues. Existing methods for employing an LLM in multi-turn psychological counseling typically involve direct fine-tuning using generated dialogues, often neglecting the dynamic stage shifts of counseling sessions. Unlike previous approaches, SuDoSys considers the different stages of counseling and stores essential information throughout the counseling process, ensuring coherent and directed conversations. The system employs an LLM, a stage-aware instruction generator, a response unpacker, a topic database, and a stage controller to maintain dialogue flow. In addition, we propose a novel technique that simulates counseling clients to interact with the evaluated system and evaluate its performance automatically. When assessed using both automatic and human evaluations, SuDoSys demonstrates its effectiveness in generating logically coherent responses. The system’s code and program scripts for evaluation are open-sourced. |
030 | AI for Mental Health | Feature Extraction Method Based on Contrastive Learning for Dysarthria Detection | Yudong Yang, Xinyi Wu, Xiaokang Liu, Juan Liu, Jingdong Zhou, Rennan Wang, Xin Wang, Rongfeng Su, Nan Yan, Lan Wang | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences;University of Chinese Academy of Sciences;University of British Columbia;Peking University Shenzhen Institute | Dysarthria detection is crucial for clinical diagnosis and treatment. However, existing methods predominantly rely on supervised learning, which requires extensive annotated data, resulting in high costs and inconsistent data quality. To address this issue, this paper proposes a feature extraction method for dysarthria detection based on contrastive learning, which does not require annotated data. This method investigates how to extract features from patients and normal individuals using different pre-trained acoustic models. By maximizing the differences in their acoustic feature spaces, this method enhances detection accuracy. Finally, multiple classification methods are employed to detect dysarthria using the extracted features, achieving significant improvements across various evaluation metrics. |
031 | AI for Mental Health | Multi-Source-Domain Adaptation for TMS-EEG based Depression Detection | Jingdong Zhou, Chongyuan Lian, Nan Li, Yudong Yang, Xiaoping Li, Yi Guo, Lan Wang, Nan Yan | Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences; University of Chinese Academy of Sciences; Institute of Neurological and Psychiatric Disorders,Shenzhen Bay Laboratory; Shenzhen People’s Hospita | The development of brain-computer interface technology systems requires accurate decoding of brain activities measured by EEG. However, due to the non-stationary characteristics of EEG signals and intra- and inter-individual variability, it is not easy to construct a reliable and universal evaluation model for different subjects. In practical applications, most of the target domains are invisible, yet the current transfer learning models based on EEG signals mostly are target-domains-visable. To address this problem, this paper proposes a deep migration learning framework that reduces individual differences through intra-subject alignment and inter-subject alignment, extracts stable features after superposition averaging using a multi-scale spatio-temporal graph neural network, and employs a multi-source domain distribution normalization method, which enables the model to be effectively generalized to the target domain. The adaptive subject normalization layer introduced in the model gradually realizes the alignment of different source domain distributions during the training process, and executes the Test-Time-Adaptation strategy in the testing phase to achieve dynamic adaptation to the target domain data. The experimental results show that the model outperforms traditional deep learning models on TMS-EEG data, while the ablation experiments verify the effectiveness of the subject-level normalization module in improving the model generalization ability. |
032 | Al for Good | Flying Together with Audio and Video: Enhancing Communication for the Hearing-impaired Through an Emerging Closed Captioning Standard | Luntian Mou, Peize Li, Haiwu Zhao, Qiang Fu, Hong Luo, Cong Liu, Nan Ma, Tiejun Huang, Wen Gao | Beijing University of Technology; Beijing Institute of Artificial Intelligence; Shanghai University of Engineering Science; Photosynthetic AI Tech Co.; China Mobile Information Technology Co.; IFLYTEK Research; Peking University; Peng Cheng Laboratory | As the text-based visual representation of a program’s audio elements, Closed Captioning primarily serves as a technology to enhance communication for the hearing impaired. Since text is much simpler than audio and video, Closed Captioning is traditionally transmitted as supplementary or auxiliary information as part of the image or in an extended or private data field of an encoded video bitstream called video elementary stream, usually accompanied by one or more audio elementary streams. Since Closed Captioning is extremely important for the accessibility of the audio content of a program to the hearing impaired, we propose to encode the closed caption into a bitstream called caption elementary stream, which can fly together with audio and video elementary streams. In other words, closed caption can be stored and transmitted in a manner similar to how audio and video are handled. We have drafted a national standard for Closed Captioning in China, which is now in its final stage of approval and publication. In this paper, the main technical content of the emerging Closed Captioning standard will be introduced. Specifically, the encoding, storage, and transmission of Closed Captioning will be described. Moreover, the decoding and presentation of Closed Captioning under the two scenarios of on demand streaming and live streaming will also be designed and discussed. The AI technology of Speech-to-Text enables Closed Captioning to be implemented efficiently with the help of manual proofreading. Positively, the emergence of the Closed Captioning standard will enhance accessibility to audio-visual programs on both the broadcasting network and the Internet for the hearing-impaired in China and worldwide. |
033 | Soft Robots 1 | Actuation-function integrated soft robot based on multi-material printing | Mingzhu Zhu | The design, manufacturing and control of soft robots have strong coupling, and emerging manufacturing technologies such as 3D printing provide new ideas for the structural design and system integration of soft robots. Based on multi-material 3D printing technology, a new integrated design and manufacturing method of soft robot drive structure and functional unit is established, which can realize the self-perception and self-adaptation of soft robot under complex working conditions. Several case studies are presented to demonstrate how to use multi-material 3D printing technology to achieve self-sensing and self-adaptation capabilities of soft robots, especially in complex and dynamically changing working environments. Results show that multi-material 3D printing technology not only improves the manufacturing efficiency and cost-effectiveness of soft robots but also expands their functions, enabling them to better adapt to changing environments and task. | |
034 | Multi-modal Models | FARD: Fully Automated Railway Anomaly Detection System | Yichen Gao, Taocun Yang, Wei Wang | Beijing Jiaotong University; Institute of Computing Technology, China Academy of Railway Sciences Corporation Limited | Foreign object detection is crucial for railway safety, preventing accidents and ensuring smooth operations. Current railway foreign object detection methods face two significant challenges: the scarcity of annotated real-world data and the inability to adapt to complex scenarios. This paper proposes a novel FARD (Fully Automated Railway Anomaly Detection System) approach to address these issues. FARD incorporates two key components: (i) A Diffusion model with inpainting technique to generate a diverse and realistic auxiliary dataset of railway anomalies, effectively representing real-world outliers. (ii) An integrated framework combining traditional object detection pipeline with reconstruction-based anomaly detection module for robust foreign object detection in railway environments. Experimental results demonstrate that FRAD outperforms traditional object detection methods in identifying anomalies on rail tracks by a large margin. This research offers a robust, data-efficient solution for railway foreign object detection that works well even with limited initial data. |
035 | Multi-modal Models | M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions | Shuai Wang, Pengcheng Zhu, Haizhou Li | Shenzhen Research Institute of Big Data; Fuxi AI Lab, NetEase Inc., Hangzhou, China; Chinese University of Hong Kong (Shenzhen) | Fixed-dimensional speaker embeddings have become the dominant approach in speaker modeling, typically spanning hundreds to thousands of dimensions. These dimensions are hyperparameters that are not specifically picked, nor are they hierarchically ordered in terms of importance. In large-scale speaker representation databases, reducing the dimensionality of embeddings can significantly lower storage and computational costs. However, directly training low-dimensional representations often yields suboptimal performance. In this paper, we introduce the Matryoshka speaker embedding, a method that allows dynamic extraction of sub-dimensions from the embedding while maintaining performance. Our approach is validated on the VoxCeleb dataset, demonstrating that it can achieve extremely low-dimensional embeddings, such as 8 dimensions, while preserving high speaker verification performance. |
036 | Medical Robots | Continuous Prediction of Multi-Joint Angles Based on Informer Model Using Novel Fused Features | Renyu Wan, Zina Zhu, Guohua Cui, Saixuan Chen | Shanghai University of Engineering Science | Surface electromyography (sEMG) signals, closely related to human motion, have been widely utilized in human-machine interaction systems for the continuous prediction of multi-joint angles. However, achieving precise angle predictions across various movements remains a significant challenge. This study introduces a novel gait feature extraction method and evaluates four models for predicting multi-joint angles during walking, running, upslope, and downslope movements. The results demonstrate that the fused features as input improves prediction accuracy across all models, with the Informer model outperforming the others. It achieves average correlation coefficients (CC) exceeding 0.99 for all three joints. Furthermore, the mean absolute error (MAE) and root mean square error (RMSE) for the ankle and hip joints are both below 1°, while for the knee joint, both metrics are within 2°. Consequently, the Informer model demonstrates sufficient accuracy to support the continuous prediction of multi-joint angles in various human motion contexts. |
037 | Human Robot Interaction | Arbitrary Surface Modeling in a Common Reality Framework for Human-Robot Collaboration | Fujian Yan, Hongsheng He | School of Computing, Wichita State University; University of Alabama | As robots become more participating in assisting people in human-centered environments, individuals who may not possess extensive technical training or backgrounds in robotics or programming will eventually need to program and reconfigure their robots to carry out diverse customized tasks. The designed platform integrates an RGB-D camera and projector to enable robots and human collaboration with simple social cues like touching. Robots, especially social robots, are deployed to a human-centered environment, which is more irregular than pre-setup environments. Therefore, this paper enhances our previous work, Common Reality Human-Robot Collaboration, by calibrating the RGB-D camera and project to adapt different types of surfaces, such as flat, curved, and uneven. This paper designs an innovative HRC platform to bridge the knowledge gap between humans and robots while collaborating on the same tasks. We evaluated the communication efficiency of the proposed method and compared the proposed HRC platform with the conventional teach-to-follow method. To investigate the user experience of the proposed platform, we evaluated the results of questionnaires. |
038 | Medical Robots | Design and Analysis of a Soft Bellow-Based Robot for Dysphagia Rehabilitation | Yu Dang | As the population ages, the incidence of diseases such as stroke is increasing, leading to a growing number of patients with dysphagia or difficulty swallowing. Dysphagia directly affects the hydration and nutritional intake, potentially leading to serious complications such as aspiration pneumonia, severely threatening patients’ health and quality of life. Currently, the rehabilitation of swallowing function mainly relies on therapists to conduct swallowing training for patients. However, the high incidence and long duration of swallowing rehabilitation training pose a burden on the patients and medical resources. There remains a gap in the robotic devices for swallowing rehabilitation. To address this gap, this paper proposes a soft bellow-shaped robot for dysphagia rehabilitation. One mode of the rehabilitation training, where the therapist manipulates the throat to move horizontally, is selected to mimic. This robot simulates the force and displacement applied by the therapist’s fingertip during dysphagia rehabilitation. First, swallowing training data were collected and analyzed to understand the variations in displacement and force applied by therapists’ fingertips on the throat. Based on this analysis, the technical requirements for the swallowing rehabilitation robot were summarized. Following these requirements, we designed a physical prototype of the pneumatic bellow-shaped robot. The robot simulates the fingertip movement on both sides of the throat through the linear elongation and contraction of the bellows. The performance of the soft bellows was then characterized, verifying that its output force and displacement meet the technical requirements. This characterization lays a foundation for the modeling and control of the displacement and force of the robot. | |
039 | Robot Control | Combined Kinodynamic Motion Planning Method for Multisegment Continuum Manipulators | Jianing Wu, Jinzhao Yang | Sun Yat Sen University | Multisegment continuum manipulators, known for their adaptability in confined spaces, present complex nonlinear dynamics that pose challenges for obstacle-avoidance motion planning. This paper introduces a kinodynamic motion planning approach for cable-driven manipulators. The problem is converted into a nonlinear optimization problem with obstacle and input constraints, segmented into safe and warning subspaces. A mixed complementarity problem is solved for safe path generation, while an improved particle swarm optimization tackles the warning subspace. The framework efficiently manages the nonlinear kinodynamic challenges, with simulations demonstrating the method’s effectiveness. |
040 | Soft Robots 1 | Toward Dielectric Elastomer Actuators with Low Driving Voltages and High Outputs for Soft Machines | Ye Shi | As one of the most promising artificial muscle materials, dielectric elastomers (DEs) have been widely studied and applied. In recent years, a variety of high-performance dielectric elastomers have been developed, especially the bimodal networked elastomer (PHDE) which has achieved energy and power densities superior to those of natural muscles. However, it remains a challenge to significantly reduce the driving voltage of dielectric elastomers while maintaining their high force/energy outputs. This study will introduce the recent progress from our group on developing multilayer structured dielectric elastomer actuators (DEAs) with low driving voltages and high mechanical outputs. These high-performance DEAs are enabled by simultaneously tuning the mechanical/dielectric properties of PHDE, developing new ways to prepare uniform ultra-thin elastomer films, and modifying the dry-stacking method to fabricate large-area dielectric elastomer stacks. The potential of newly developed DEAs for applications in wearable devices and soft robots will also be demonstrated. | |
041 | Al for Good | Cued Speech-Enhanced Audio-Visual Variational Autoencoder for Speech Enhancement | Lufei Gao, Yan Rong, Li Liu | The Hong Kong University of Science and Technology | Speech enhancement (SE) is essential for improving the quality and intelligibility of speech signals, particularly in noisy environments. In this paper, we propose an innovative approach to Audio-Visual Speech Enhancement (AVSE) by modifying an audio-visual variational autoencoder (AV-VAE) framework to integrate both lip movements and hand gestures from Cued Speech (CS) as visual cues. This is the first work to incorporate hand gestures, in addition to lip movements, within the AVSE task. By introducing hand cues, our approach aims to address the inherent challenges of lip reading, such as the high ambiguity in interpreting lip movements, which can limit the effectiveness of traditional AVSE methods. Leveraging deep learning and computer vision techniques, our method offers a more comprehensive representation of spoken content. Through empirical evaluation, we demonstrate the effectiveness of our approach in enhancing the clarity and quality of speech signals, even in challenging acoustic conditions. The results indicate that the integration of hand cues significantly improves speech quality and intelligibility, providing a promising solution for AVSE in noisy environments. |
042 | Tactile Sensing | Deep-Learning-Based Flexible Manipulation of Catheters with Force Feedback | Chuqiao Lyu, Wenbo Ding | Tsinghua SIGS | Endovascular surgical robots (ESR) have been widely utilized to minimize radiation exposure and alleviate physician fatigue through teleoperation. However, during surgical procedures, ESRs frequently encounter difficulties in replicating the dexterity of human fingers when manipulating the soft and slender catheters. Incorporating soft grippers can undoubtedly augment the flexibility of ESRs, but the inherent nonlinear deformation of the materials presents challenges in terms of force sensing and control. To tackle these challenges, this study presents a deep-learning-based approach for flexible catheter manipulation coupled with force feedback. The proposed approach leverages a Long Short-Term Memory (LSTM) model, which is comprehensively trained using datasets that encapsulate both the robot’s motions and the deformations of the soft gripper. This model demonstrates remarkable capabilities in discerning the intricate deformations of the soft gripper and accurately estimating both linear and torsional forces exerted during surgical operations. Moreover, our method not only streamlines the grasping efficiency but also empowers ESRs with the ability of force feedback. This mechanisms allows ESRs to adaptively adjust their manipulation strategies in response to real-time force feedback, thereby enhancing the precision and safety of surgical interventions. |
043 | Soft Robots 2 | Reprogrammable Bistable Structures and its Applications | Yingtian Li | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences | The bistable structure has fast response and force amplification capabilities. Rapid release of the energy stored in the bistable structure can improve many robot performances, such as high-speed movement, adaptive sensing, and fast grasping. However, current research on bistable structures mainly focuses on their stable states, but there is a lack of research on intermediate states. In pursuit of this objective, a series of studies were undertaken on adjustable bistable states. We first propose a super-tunable bistable structure with programmable energy barriers and variable triggering force across orders of magnitude and its design method. Then, we apply this bistable state to different scenarios and amplify its different characteristics. It has successively realized that the triggering force of a single-piece bistable structure can be adjusted to 0.1% of the maximum value; the reprogrammable bistable gripper can realize selective grabbing of objects of different weights, and swimming activities within 180ms. Fish are captured passively; the flexible and ultra-fast bistable gripper can easily capture high-speed incident flying objects of 15m/s. This series of work can expand the frontier of bistable structure design and provide new ideas for the future of robotics and other fields. |
044 | Tactile Sensing | A New Multi-Axis Force Sensor for Measuring the Wheel-Terrain Interaction ahead of the Robotic Vehicles | Mujia Shi, Lihang Feng, Lixin Jia, Aiguo Song | Nanjing Tech University; Southeast University | In extraterrestrial exploration missions, complex terrains affect the terrain traversability of planetary rovers, consequently impacting their mission completion rates. Existing methods that rely on visual perception for envi-ronmental sensing are unable to detect soft terrains, which may lead to is-sues such as wheel sinkage for planetary rovers. In this paper, a forward tac-tile perception wheel specifically for wheel-terrain contact forces detection is designed. Our approach includes mechanical structure design, decoupling methods, and component integration techniques to genuinely incorporate multi-axis sensors into the forward sensing wheel, achieving high-precision and high-reliability wheel-soil interaction detection. Experiments have demonstrated the effectiveness of the designed forward sensing wheel. |
045 | Soft Robots 2 | The Intelligent Underwater Navigation and Advanced Manipulation Empowered by New-conceptual Soft Robotic Technologies | Juntian Qu | Facing the national demand and Shenzhen’s “20+8” industrial cluster, actively respond to the strategic goal of “Marine power”, in order to achieve the goal of marine intelligent cruise and exploration, Prof. Juntian Qu’s Team carry out research on marine soft robots and underwater intelligent flexible grasping. In terms of Marine soft robots, facing the underwater intelligent cruise detection task, the goal is to achieve a miniaturized intelligent underwater soft robot with more flexible motion ability and endurance. Based on the visual image, the intelligent environment perception system of underwater robot is constructed, and the sensing and driving system are integrated to ensure the orderly unity of the control system robustness and structural stability. Finally, the design of soft underwater robot with flexible action and perfect bionic function is realized. In terms of underwater intelligent flexible grasping, for underwater intelligent operation control tasks (deep sea sampling, bottom fishing, etc.), the goal is to realize an underwater intelligent flexible grasping system based on multi-mode fusion sensing technology. The ultimate goal is that the flexible manipulator can not only complete the lossless grasping task in complex underwater environment (turbid, dark light, strong interference, etc.), but also obtain multi-mode grasping perceptual information. To improve the performance of underwater robot in Marine environment, multimodal sensing characteristics are proposed as well. | |
046 | Tactile Sensing | A Multimodal Tactile Sensor for Robot Perception and Interaction | Pengwen Xiong, Yuxuan Huang, Yifan Yin | Nanchang University; China Jiangxi Key Laboratory of Intelligent Robot, Nanchang | Robots with multi-sensors always have a problem of weak pairing among different modals of the collected information produced by multi-sensors, which leads to a bad perception performance during robot interaction. To solve this problem, this paper proposes a Force Vision Sight (FVSight) sensor, which utilizes a distributed flexible tac tile sensing array integrated with a vision unit. This innovative approach aims to enhance the overall perceptual capabilities for object recognition. The core idea is using one perceptual layer to trigger both tactile images and force-tactile arrays. It allows the two heterogeneous tactile modal information to be consistent in the temporal and spatial dimensions, thus solving the problem of weak pairing between visual and tactile data. Two experiments are specially designed, namely object classification and slip detection. A dataset containing 27 objects with deep presses and shallow presses is collected for classification, and then 20 slip experiments on three objects are conducted. The determination of slip and stationary state is accurately obtained by covariance operation on the tactile data. The experimental results show the reliability of generated multimodal data and the effectiveness of our proposed FVSight sensor. |
047 | Tactile Sensing | RoboCare: A Rigid-Soft Hybrid Arm and Tactile Dexterous Hand for Household Assistance | Shilong Mu, Henry Chan, Xinyue Chai, Runze Zhao | TBSI/SIGS Tsinghua University | Humanoid robots are becoming the next generation of embodied intelligence. In particular, the service robot market is expected to reach $154 billion by 2030, mainly used to assist with housework, childcare, and elderly care. Therefore, we propose a new system: RoboCare, a rigid-soft hybrid robotic arm and tactile dexterous hand. The robotic arm contains distributed airbags to provide protection and real-time dynamic tactile information. In addition, the three-finger dexterous hand provides multimodal tactile information through a thin electronic skin and visual-tactile fusion, costing about $800. Based on this, we also proposed a control strategy, which was tested in four different manipulation tasks in combination with imitation learning methods, with a task success rate of over 80%. We hope that these safe, intelligent, and sensitive robots can become part of the family. |
048 | Tactile Sensing | A Flexible Electret Actuator for Wearable Haptic Interface | Haiyang Wan, Jian Jiao | Future Tech, South China University of Technology | Haptic feedback is able to provide touch sensations to improve interactive immersion experiences of virtual or augmented reality. Flexible haptic feedback technologies are of particular interest for achieving natural human-machine interaction in applications of virtual or augmented reality. Herein, we report a flexible wearable electret haptic actuator for human-machine interaction, able to create various vibro-tactile stimulations by modulating the actuating voltage. We describe the principle, design, simulation, materials, and fabrication that serve as the foundations for such haptic actuators. Performance characterizations of mechanical vibrations validate that the haptic actuators have high vibrant bandwidth ranges in frequency and amplitude. Subjective identifications of vibro-tactile sensations generated by the haptic actuators indicate its effectiveness. A typical application in human-machine interaction of virtual reality shows the potential of the haptic actuators. |
049 | Medical Robots | Parametrically-Designed Artificial Hand with Multifunctional Grasps | John-John Cabibihan, Mohammed Mudassir | Qatar University | The typical methods for developing artificial hands require long development time. It also takes assistance from specialists to fine- tune the details such as size, appearance, and fitting. Moreover, most highly-functional electric-powered prostheses are expensive, keeping them beyond the affordability of a large number of users. Sometimes the pa- tients themselves need rehabilitative training to acquire skills in using these devices and accepting their new realities. To address some these issues, a passive parametric 3D-printed artificial hand design is proposed in this work. The parameters are obtained from some anthropometric measurements from the non-injured hand. The technologies for rapid fabrication of custom designs are improving, making them more popular and affordable. The artificial hand model has been designed with para- metric modeling techniques. The fabricated passive hand can perform up to 31 grasps out of the 33 grasps that the human hand can perform. |
050 | Medical Robots | Active Data-Driven Modeling Towards Autonomous Flexible Endoscopic Surgery | Xiangyu Wang, Yongchun Fang, Yu Ningbo, Jianda Han | NankaiUniversity; Shenzhen Research Institute of Nankai University | The modeling and control of robotic flexible endoscopes remain challenging due to their inherent complex nonlinearity. This paper develops an novel modeling scheme using the radial-basis-function-network-based Koopman operator (RKO), which provide a elaborate data-driven method to model the motion of robotic flexible endoscope in the clinical scene. Different from the traditional Koopman operator-based model, the RKO employs a radical basis function network to identify the expressions of the Koopman matrix and its characteristic functions. By utilizing the proposed Koopman operator, a concise linear model can be built to describe the complex dynamics of a soft manipulator. The detailed experimental validation and results will be represented in our future work. |
051 | Tactile Sensing | Mag-Gesture:Smart Knit Gloves Capture Complex Gestures and Programmable Feedback | Runze Zhao, Xinyue Chai, Xiaosa Li, Ziman Chen, Shilong Mu, Liguang Ruan, Wenbo Ding | Tsinghua University; Guangzhou Academy of Fine Arts | Gesture recognition is pivotal for enhancing human-computer interaction, with applications spanning virtual reality, gaming, and assistive technologies. Traditional methods, relying on visual sensors and inertial measurement units, often face limitations such as sensitivity to lighting conditions and privacy concerns. In this paper, we propose a novel gesture recognition system leveraging magnetometers and magnetic units worn on fingertips, which also provide haptic feedback. This method offers robust performance independent of lighting conditions and ensures user privacy. We detail the hardware setup, comprising compact magnetometers and custom-designed magnetic units capable of delivering haptic feedback. Our software architecture involves efficient data acquisition, preprocessing, and feature extraction from the magnetic field data. We employ machine learning algorithms to classify gestures based on the extracted features. Extensive experiments demonstrate the system’s high accuracy and real-time capabilities. We compare our approach with conventional visual and inertial methods, highlighting its superior performance in various environmental conditions and the added benefit of haptic feedback. Despite challenges such as potential magnetic interference and user comfort, our results indicate that magnetometer-based gesture recognition is a promising direction for future research and practical applications. |
052 | Human Robot Interaction | A Novel Human-Robot Interaction System Based On Augmented Reality-Enhanced Teleoperation | Xingchao Wang, Zhenglong Sun | The Chinese University of Hong Kong, Shenzhen | A novel human-robot interaction system combines augmented reality (AR) with robotic teleoperation, enhancing seamless human-robot interaction by allowing users to precisely manipulate objects remotely. Inspired by the concept of teleoperation, the system uses AR glasses to provide real-time visual feedback, enabling operators to perform tasks in a virtual environment, which are then executed by a robotic arm in the real world. This integration of virtual and real elements improves the accuracy and efficiency of remote operations. The interaction system includes selecting and manipulating objects in the virtual environment and realizing these actions in the physical world. This approach is particularly beneficial for critical applications such as medical robotics. Potential applications also include medical training and simulations, offering realistic environments for skill development. Challenges involve ensuring reliability across various scenarios and developing user-friendly interfaces. Experimental results show that the system can achieve high precision, with an operational deviation as low as 0.33 centimeters, which is crucial for precise surgical procedures. The results demonstrate the system’s effectiveness in reducing errors and simplifying complex tasks, enhancing remote precision operations. Future integration aims to achieve fully autonomous operations and improve decision-making capabilities. This advancement could expand the system’s applications across various fields, enhancing human-robot interaction. This AR-based teleoperation system demonstrates significant potential in healthcare and other domains. |
053 | Human Robot Interaction | ROOTED: An Open Source toolkit for Dialogue Systems in Human Robot Interaction | Antonio Galiza Cerdeira Gonzalez, Ikuo Mizuuchi, Bipin Indurkhya | Jagiellonian University; Tokyo University of Agriculture and Technology | Dialogue Systems are integral to enabling human-machine interaction, particularly in social robotics where natural communication is essential. However, there is no standardized framework for developing Dialogue Systems, and the integration of Large Language Models (LLMs) introduced challenges such as hallucinations, outdated information, and sycophancy. These issues can compromise trust and effectiveness in high-stakes applications. Additionally, LLMs often violate Gricean Maxims, affecting user interpretation and system efficacy. Addressing these gaps, we present ROOTED (ROS2 Open-source Toolkit for Efficient Dialogue), the first open-source ROS2 framework for developing adaptable dialogue systems for social robots. ROOTED combines rule-based dialogue generation, web search, and LLMs to better adhere to Gricean Maxims by grounding responses. It also includes components for non-verbal communication analysis and generation. ROOTED’s capabilities are demonstrated through its deployment in the Social Plantroid Robot, which uses ROOTED for managing interactions and emotional responses. |
054 | Human Robot Interaction | Complex Instructions Translation Using Fine-Tuned Large Language Models | Minhazul Arefin, Dang Tran, Hongsheng He | The University of Alabam | Artificial Intelligence has made great progress in the area of Natural Language Processing, notably in the area of Human-Robot Interaction. The use of controlled robot language is an essential component in the process of enabling robots to comprehend and carry out human directions with pinpoint accuracy. The purpose of this work is to provide a unique methodology that utilizes supervised fine-tuning of big language models in order to increase the accuracy of translation from natural language to Controlled Robot Language. This strategy considerably improves the dependability and effectiveness of human-robot interactions, as demonstrated by our extensive experimental investigation, which demonstrates that this approach demonstrates this improvement. The results of this study suggest that our methodology has the potential to result in more reliable robotic systems, which would be beneficial to the field of Human-Robot Interaction. |
055 | Human Robot Interaction | Controlled Robotic Language for multiple robots collaboration | Dang Tran, Minhazul Arefin, Hongsheng He | The University of Alabama | Interacting with multiple robots presents significant challenges for non-expert users, especially in systems requiring complex coordination. While communication methods for single robots have been extensively studied, reliable control frameworks for multi-robot systems remain underexplored. This paper introduces Controlled Robot Language framework (CRL) for multi-robot systems using natural language. The framework focuses on both performance reliability and user-friendliness for the end-users. Given the large-context description, the model generates appropriate planning scripts in Planning Domain Definition Language, which are then used to directly trigger robot actions. The method can automatically identify the action’s subject and object, and return the resource to the appropriate agent. The CRL framework addresses key challenges available in multi-robot systems, such as semantic ambiguities, domain generality, and scalability. The method’s effectiveness was demonstrated through both linguistic evaluations and real-world experiments, highlighting its robustness and applicability across various domains. This work contributes to the accessibility and usability of multi-robot systems for regular users, focusing on communcation stability and friendliness. |
056 | Human Robot Interaction | Omni-surface Spatial Augmented Reality for Intuitive Human-Robot Collaboration | Akhlak Zaman, Yinlong Zhang, Hongsheng He | The University of Alabama; Shenyang Institute of Automation | Human-robot teaming is increasingly essential in collaborative environments, where effective communication and clear understanding between humans and robots are key to successful interaction. The projection of images on non-planar surfaces using a conventional projector is challenging due to the inherent problem of distortion, which arises from the variation in depth of different points on the surface. The proposed method utilizes an RGB-D sensor to capture surface geometry, allowing us to calculate the extrinsic parameters between the projector and the surface. It can effectively correct the distortion using a pre-warped image that fits the correction area. The pre-warped image is formed based on the surface geometry that the projector displays and accurately replicates the original projection image. Beyond the technical achievement, this research highlights the social acceptance of improved spatial augmented reality in human-robot teams. It fosters better teamwork, trust, and efficiency by enabling more intuitive and reliable interactions. |
057 | Human Robot Interaction | Human Grasp Habits Interpretation and Adaptation for Social Robots | Hui Li, Hongsheng He | The University of Alabama | Adapting to human grasping habits is crucial for social robots to ensure smooth, safe, and culturally appropriate interactions. However, current robotic systems often struggle to interpret and adapt to diverse grasping behaviors, where grasp poses can vary even within the same grasp topology due to individual habits, limiting their effectiveness in social contexts. In this paper, we present a grasp adaptation algorithm that recognizes various grasping habits, classifies them into a standard grasp topology, and determines an appropriate object positioning strategy accordingly. The RGB image of a grasping pose is recognized and abstracted into a set of 21 3D points, which are then mapped to one of six predefined standard grasp topologies using a deep learning network. Based on the identified topology, key points are extracted from the abstracted grasp, and the object positioning strategy is determined using these key points. A reinforcement learning model is developed to perform object positioning. We validate our approach experiments, demonstrating the effectiveness of the system. |
058 | Soft Robots 2 | Modeling and Control of Coupled Soft Viscoelastic Actuators based on Sparse Identification Method | Jisen Li, Anjing Cheng, Hao Wang, Zhipeng Xu, Jian Zhu | Shenzhen Institute of Artificial Intelligence and Robotics for Society; Chinese University of Hong Kong | Dielectric elastomer actuators (DEAs) are extensively employed in soft robot design due to their intrinsic compliance, substantial voltage-induced deformation, and high energy density. However, the nonlinear dynamic response arising from rate-dependent viscoelasticity poses a challenge in modeling and control. While recent studies predominantly focus on analyzing one-degree-of-freedom (1-DOF) DEAs, the modeling and control of multiple coupled DEAs remain demanding but underexplored, primarily due to a limited understanding of the cross-coupling effect between DEAs. This paper introduces a comprehensive framework for modeling and controlling multiple coupled DEAs, leveraging the sparse identification method. This approach can construct explicit governing equations to describe the viscoelastic and coupling effects of DEAs from experimental measurement data. Utilizing the identified explicit dynamic equations, model predictive controllers are designed for DEAs to accurately track various trajectories. The presented control framework is validated through high-precision tracking control experiments on both 1-DOF and 2-DOF DEAs. The proposed approach represents a significant promising advancement in modeling multiple coupled viscoelastic DEAs, which may pave the way for accurate control of soft actuators/robots with enhanced versatility and functionality. |
059 | Robot Control | Leader-Follower Formation of a Car-like Robot Using ROS and Trajectory Tracking | Faris Shahab, Anas Dhahri, Adam Hamadi, Mohammad Noorizadeh, John-John Cabibihan, Nader Meskin | Qatar University; Private Higher School of Engineering and Technology (ESPRIT) | This paper explores the integration of the Robot Operating System (ROS) with human-robot interaction models for autonomous vehicle navigation in social robotics. Utilizing ROS as middleware, OptiTrack for positioning and localization, and MATLAB for control and ROS node deployment, this study investigates leader-follower formation dynamics in mixed environments. The leader vehicle is controlled using an RVIZ pointer, serving as a medium for human interaction, while both the leader and follower vehicles maintain safe distances using trajectory tracking methods. These methods are implemented and deployed to the robot cars as ROS nodes. The experiment involves a physical leader car and a virtual follower car, both tracked using OptiTrack. Numerical simulations and real-world experiments demonstrate ROS’s role in enhancing coordination and interaction, providing insights into effective human-robot interaction and collision avoidance. |
060 | Soft Robots 2 | Soft Artificial Muscles for Humanoid Eyeball Motions | Zhen Luo, Jian Zhu | The Chinese University of Hong Kong; Shenzhen Institute of Artificial Intelligence and Robotics for Society | Natural eyeball motions in humanoid robots can contribute to friendly communication, thus improving the human-robot interaction. In this paper, we develop antagonist–agonist artificial muscles for humanoid eyeball motions, by using dielectric elastomer actuators (DEAs). Inspired by human eyeballs, the artificial muscles consist of two pairs of DEA: one pair for the horizontal motion, and the other for the vertical motion. The fabrication time of actuators can be significantly decreased due to their simple structure. The antagonist–agonist actuator outperforms the dielectric elastomer minimum energy structure in terms of actuation displacement and response time. We conduct experiments in a lifesize human face model. The experiments demonstrate the capability of antagonist–agonist artificial muscles to mimic eyeball motions in the horizontal, vertical, and diagonal directions. Future work includes modeling and control of artificial muscles for optimal performance of various humanoid eyeball motions. |
061 | Tactile Sensing | Three-Dimensional Modular Tactile Sensor for Surface Texture Recognition | Hongwu Zhu, Yifei Yang, Jingshu Shi | Skyworth Group Co., Ltd.; Tsinghua Shenzhen International Graduate School | Tactile perception is a primary sensing channel for robots to discern the characteristics of the object surface for contact modeling and dexterous ma-nipulation. Achieving delicate texture recognition for robots is an open chal-lenge due to the lack of reliable tactile sensing systems and smart pattern recognition algorithms. Herein, an island-shaped three-dimensional force tactile sensor based on magnetic field reconstruction and a novel texture recognition model are designed. We adopt the bionic tactile sensor to collect vibration data from different materials by simply pressing and sliding against them under a fixed contact pressure and speed. With the tactile perception data, sequential pattern recognition algorithm is proposed texture recogni-tion. The test accuracy of this method has achieved a respective recognition accuracy of 97% for 50 types of fabrics. In the detection process, three-dimensional type force data plays a key role for improving performance. In addition to merely normal force, shear force also provides rich contact char-acteristics. Our method reveals the potential benefit for various applications, including enhancing perception in dexterous manipulation and facilitating defect detection within the textile industry. |
062 | Multi-modal Models | Personalized Facial Reaction Generation for Digital Human | Linlin Shen | Shenzhen University | Driven by the rapid progress of LLMs (Large Language Models), digital human is now able to understand human emotions, perform advanced interactions with user, and even answer lots of professional questions. However, how to generate personalized facial reactions for digital human, remain under explored. In this talk, I will first introduce several of our Multimodal LLMs, in both general NLP and specific domains like Faces and medical diagnosis, which is followed by the talking demo of 2D/3D digital human, supported by our Linly LLM. Following that, I will introduce our work in emotion understanding, facial reaction generation, and personalized facial reaction generation. The background, dataset, evaluation metrics and SOTA on this specific topic, will be introduced in details. Finally, the application of facial reaction generation for social robotics, will also be discussed. |