/

En

Keynote Speakers


  • Prof. Song-Chun Zhu
  • Dean of Beijing Institute for General Artificial Intelligence, Chair Professor of Peking University, Chair Professor of Basic Science of Tsinghua University
  • Bio: Song-Chun Zhu was born in Ezhou, Hubei Province, China. He is a world-famous computer vision expert, statistics and applied mathematician, and expert in artificial intelligence. He graduated from the University of Science and Technology of China in 1991, studied in the United States from 1992, and received a PhD degree in computer science from Harvard University in 1996. From 2002 to 2020, he was a professor of the Department of Statistics and Computer Science in UCLA, and the director of UCLA Center for vision, cognition, learning and autonomous robotics. He has published more than 300 papers in international top journals and conferences, won many international awards in the fields of computer vision, pattern recognition and cognitive science, including three top international awards in the field of computer vision - Marr Prize, Helmholtz award, etc., twice served as the chairman of the International Conference on Computer Vision and Pattern Recognition (CVPR2012, CVPR2019), and from 2010 to 2020 twice served as Director of MURI, an US multi University and interdisciplinary cooperative project in the fields of vision, cognitive science and artificial intelligence. Professor Song-Chun Zhu has long been committed to building a unified mathematical framework of computer vision, cognitive science, and even artificial intelligence science. After 28 years in the United States, Professor Zhu returned to China in September 2020 to serve as dean of Beijing Institute for General Artificial Intelligence, Chair Professor of Peking University and Chair Professor of Basic Science of Tsinghua University.

  • Title:Computer Vision: A Task-oriented and Agent-based Perspective

  • Abstract: In the past 40+ years, computer vision has been studied from two popular perspectives: i) Geometry-based and object centered representations in the 1970s-90s, and ii) Appearance-based and view-centered representations in the 2000-2020s. In this talk, I will argue for a third perspective: iii) agent-based and task-centered representations, which will lead to general purpose vision systems as integrated parts of AI agents. From this perspective, vision is driven by large number of daily tasks that AI agents need to perform, including searching, reconstruction, recognition, grasping, social communications, tool using, etc. Thus vision is viewed as a continuous computational process to serve these tasks. The key concepts in this perspective are physical and social representations for: functionality, physics, intentionality, causality and utility.

  • Prof. Bo Xu
  • President of the Institute of Automation of Chinese Academy of Sciences, Dean of the School of Artificial Intelligence of University of Chinese Academy of Sciences
  • Bio: Bo Xu is currently the president of the Institute of Automation of Chinese Academy of Sciences, the dean of the School of Artificial Intelligence, University of Chinese Academy of Sciences, and a member of the National New Generation Artificial Intelligence Strategic Advisory Committee. He has long been engaged in the research and application of intelligent speech processing and artificial intelligence technology. He has won the "Outstanding Youth Award of Chinese Academy of Sciences", "First Prize of Wang Xuan News Science and Technology Progress" and other awards. He has presided over a number of key projects and foundations including national support, 863, 973 and natural science foundation. His research results have been applied on a large scale in the fields of education, broadcasting and television, and security. In recent years, he mainly focuses on the research and exploration on auditory models, brain-like intelligence, cognitive computing, and game intelligence.

  • Title:Three-modality Fundamental Big Model - Exploring the Path to More General Artificial Intelligence

  • Abstract: With the proposal of the text model GPT3/BERT and so on, the pre-trained models present a trend of rapid development, and the dual-modal models of image-text joint learning are also emerging, showing the powerful ability to automatically learn different tasks and quickly migrate to the data in different domains under unsupervised conditions. However, current pre-trained models ignore sound information. There are also a lot of sounds around us, where speech is not only a means of communication between humans, but also contains emotions and feelings. In this talk, I will introduce the first image-text-audio three-modal large model named "ZiDongTaiChu" which includes the modality of speech. The model maps different modalities of vision, text, and speech to a unified semantic space through their respective encoders, and then learns the semantic association and feature alignment among the modalities through the multi-head self-attention mechanism to form a multi-modal unified knowledge representation. It can realize not only cross-modal understanding, but also cross-modal generation, and at the same time achieve a balance between understanding and generating cognitive abilities. We propose a unified multi-level and multi-task self-supervised learning framework based on token-level, modality-level and sample-level, which provides model-based support for more diverse and wider downstream tasks, and especially we realize the function of generating audio from image and generating image from audio through the semantic network. The three-modal large model is an important attempt towards universal artificial intelligence with artistic creation capabilities, powerful interaction capabilities and task generalization capabilities.

  • Prof. Jingyi Yu
  • Vice Dean of Studies of ShanghaiTech University, Executive Dean of the School of Information Science and Technology, ShanghaiTech University
  • Bio: Jingyi Yu is the Vice Dean of Studies of ShanghaiTech University and Executive Dean of the School of Information Science and Technology. Before joining the ShanghaiTech, he used to work as the professor of Department of Computer & Information Sciences, University of Delaware, USA. He obtained his bachelor’s degree in Applied Mathematics and Computer at Caltech in 2003, master’s degree in Computer and Electronic Engineering at MIT in 2003, and doctor’s degree in Computer and Electronic Engineering at MIT in 2005. He engages in the research on Computer Vision, Computed Imaging, Computer Graphics, and Bioinformatics, and has published more than 120 research papers, among which more than 70 are published on international conferences CVPR/ICCV/ECCV and journals TPAMI. He has been granted more than 20 Patents of United States and won the National Science Foundation for Distinguished Young Scholars in 2009 and Air Force Young Investigator Award in 2010. He serves as the Editorial Board Member of IEEE TPAMI, IEEE TIP and Elsevier CVIU, and Program Chair of ICPR 2020, IEEE CVPR 2021, IEEE WACV 2021 and ICCV 2025. Because of his contribution on Computer Vision and Computing Imaging, he has been elected as IEEE Fellow.

  • Title:Neural Human Reconstruction: From Rendering to Modeling

  • Abstract: Recent advances on deep learning, in particular, neural modeling and rendering, have renewed interests on developing effective 3D imaging solutions. Such techniques aim to overcome the limitations of traditional 3D reconstruction techniques such as structure-from-motion (SfM) and photometric stereo (PS) by reducing reconstruction noise, tackling texture-less regions, and synthesizing high quality free-view rendering. In this talk, I present recent efforts from my group at ShanghaiTech on neural human modeling techniques. Specifically, I demonstrate our latest neural human body reconstructor, deep 3D face synthesizer, anatomically correct 3D hand tracker, and ultra-realistic hair modeler. These solutions can produce dynamic virtual humans at an unprecedented visual quality as well as lead to profound changes to MetaVerse creation technologies. Finally I will discuss extensions of these techniques to Non-Line-of-Sight (NLOS) imaging systems for hidden object recovery.

  • Prof. Lei Zhang
  • Dept. of Computing, The Hong Kong Polytechnic University
  • Bio: Prof. Lei Zhang joined the Department of Computing, The Hong Kong Polytechnic University, as an Assistant Professor in 2006. Since July 2017, he has been a Chair Professor in the same department. His research interests include Computer Vision, Image and Video Analysis and Pattern Recognition, etc. Prof. Zhang has published more than 200 papers in those areas. Prof. Zhang is an IEEE Fellow, a Senior Associate Editor of IEEE Trans. on Image Processing, and is/was an Associate Editor of IEEE Trans. on Pattern Analysis and Machine Intelligence, SIAM Journal of Imaging Sciences, IEEE Trans. on CSVT, and Image and Vision Computing, etc. He has been consecutively selected as a “Clarivate Analytics Highly Cited Researcher” from 2015 to 2021.

  • Title:Gradient Centralization and Feature Gradient Decent for Deep Neural Network Optimization

  • Abstract: The normalization methods are very important for the effective and efficient training of deep neural networks (DNNs). Many popular normalization methods operate on weights, such as weight normalization and weight standardization. We propose a very simple yet effective DNN optimization technique, namely gradient centralization (GC), which operates on the gradients of weights directly. GC simply centralizes the gradient vectors to have zero mean. It can be easily embedded into the current gradient based optimization algorithms with just one line of code. GC demonstrates various desired properties, such as accelerating the training process, improving the generalization performance, and the compatibility for fine-tuning pre-trained models. On the other hand, existing DNN optimizers such as stochastic gradient descent (SGD) mostly perform gradient descent on weight to minimize the loss, while the final goal of DNN model learning is to obtain a good feature space for data representation. Instead of performing gradient descent on weight, we propose a method, namely feature SGD (FSGD), to approximate the output feature with one-step gradient descent for linear layers. FSGD only needs to store an additional second-order statistic matrix of input features, and use its inverse to adjust the gradient descent of weight. FSGD demonstrates much better generalization performance than SGD in classification tasks.

  • Prof. Yoichi Sato
  • University of Tokyo, Japan
  • Bio: Yoichi Sato is a professor at Institute of Industrial Science, the University of Tokyo. He received his B.S. degree from the University of Tokyo in 1990, and his MS and PhD degrees in robotics from School of Computer Science, Carnegie Mellon University in 1993 and 1997. His research interests include first-person vision, and gaze sensing and analysis, physics-based vision, and reflectance analysis. He served/is serving in several conference organization and journal editorial roles including IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Computer Vision and Image Understanding, CVPR 2023 General Co-Chair, ICCV 2021 Program Co-Chair, ACCV 2018 General Co-Chair, ACCV 2016 Program Co-Chair and ECCV 2012 Program Co-Chair.

  • Title:Understanding Human Activities from First-Person Perspectives

  • Abstract: Wearable cameras have become widely available as off-the-shelf products. First-person videos captured by wearable cameras provide close-up views of fine-grained human behavior, such as interaction with objects using hands, interaction with people, and interaction with the environment. First-person videos also provide an important clue to the intention of the person wearing the camera, such as what they are trying to do or what they are attended to. These advantages are unique to first-person videos, which are different from videos captured by fixed cameras like surveillance cameras. As a result, they attracted increasing interest to develop various computer vision methods using first-person videos as input. On the other hand, first-person videos pose a major challenge to computer vision due to multiple factors such as continuous and often violent camera movements, a limited field of view, and rapid illumination changes. In this talk, I will talk about our attempts to develop first-person vision methods for different tasks, including action recognition, future person localization, and gaze estimation.