HUMAN-CENTERED VISION SYSTEMS

Hamid Aghajan, Stanford University, USA

and

Nicu Sebe, University of Trento, Italy

Abstract

This tutorial will take a holistic view on the research issues and applications of Human-Centered Vision Systems focusing on three main areas: (1) multimodal interaction: visual (body, gaze, gesture) and audio (emotion) analysis; (2) smart environments; (3) distributed and collaborative fusion of visual information.
Human-computer Interaction lies at the crossroads of many research areas (computer vision, multimedia, psychology, artificial intelligence, pattern recognition, etc.) and is used in a wide range of applications. In particular, we are aiming at developing human-centered information systems. The most important issue here is how to achieve synergism between man and machine. The term human-centered is used to emphasize the fact that although all existing vision systems were designed with human users in mind, many of them are far from being user friendly. What can the scientific/engineering community do to affect a change for the better?
On the one hand, the fact that computers are quickly becoming integrated into everyday objects (ubiquitous and pervasive computing) implies that effective natural human-computer interaction is becoming critical (in many applications, users need to be able to interact naturally with computers the way face-to-face human-human interaction takes place). On the other hand, the wide range of applications that use multimedia, and the amount of multimedia content currently available, imply that building successful computer vision and multimedia applications requires a deep understanding of multimedia content. The success of human-centered vision systems, therefore, depends highly on two joint aspects: (1) the way humans interact naturally with such systems (using speech and body language) to express emotion, mood, attitude, and attention, and (2) the human factors that pertain to multimedia data (human subjectivity, levels of interpretation).
In this tutorial, we take a holistic approach to the human-centered vision systems problem. We aim to identify the important research issues, and to ascertain potentially fruitful future research directions in relation to the two aspects above. In particular, we introduce key concepts, discuss technical approaches and open issues in three areas: ((1) multimodal interaction: visual (body, gaze, gesture) and audio (emotion) analysis; (2) smart environments; (3) distributed and collaborative fusion of visual information.
The tutorial sets forth application design examples in which a user-centric methodology is adopted across the different stages from feature and pose estimation in early vision to user behavior modeling in high-level reasoning. The role of query for users feedback will be discussed with examples in smart home applications. Several implemented applications based on the notion of user-centric design will be introduced and discussed.
The focus of the short course, therefore, is on technical analysis and interaction techniques formulated from the perspective of key human factors in a user-centered approach to developing Human-Centered Vision Systems.

Outline

This tutorial will enable the participants to understand key concepts, state-of-the-art techniques, and open issues in the areas described below. In relation to the conference, the tutorial will cover parts of the following topic areas:

  • New paradigms for HCI: smart environments, smart networked objects, augmented + mixed realities, ubiquitous computing, pervasive computing, tangible computing, intelligent interfaces and wearable computing.
  • Vision for smart environments: overview of techniques and state of the art in body tracking and pose, gaze detection, etc.
  • Multi-camera networks: user activity and behavior modeling, smart homes, occupancy-based services, distributed and collaborative processing.
  • Multimodal emotion recognition for affective retrieval and in affective interfaces: approaches to multimedia content analysis and interaction that use speech and facial expression recognition.
  • Machine learning: adaptive multimodal interfaces and learning of visual concepts from user input for automatic detection and recognition (detection of scenes, objects, or events of interest).
  • Multimodal fusion: technical approaches and issues in combining multiple media (e.g., audio-visual) for multimodal interaction and multimedia analysis.
  • Interfaces between vision processing module and high-level reasoning, the role of feedback to vision, knowledge accumulation, user behavior modeling, environment discovery
  • Applications: traditional and emerging application areas will be described with specific examples in smart conference room research, arts, interaction for people with disabilities, entertainment, and others.

Background and Potential Target Audience

The short course is intended for PhD students, scientists, engineers, application developers, computer vision specialists and others interested in the areas of information retrieval and human-computer interaction. A basic understanding of image processing and machine learning is a prerequisite.

Biography

Hamid Aghajan is a professor of Electrical Engineering (consulting) at Stanford University since 2003. Areas of research in his group consist of multi-camera networks and human interfaces for smart, vision-based reasoning environments, with application to smart homes, occupancy-based services, assisted living and well being, smart meetings, and avatar-based communication and social interactions. Hamid is co-editor-in-chief of the Journal of Ambient Intelligence and Smart Environments. He has co-authored 3 edited volumes on: Human-centric Interfaces for Ambient Intelligence, Multi-Camera Networks Principles and Applications, and Handbook of Ambient Intelligence and Smart Environments. He has been editorial board member of the book series on Artificial Intelligence and Smart Environments by IOS Press, associate editor of Machine Vision and Applications, guest editor of IEEE J-STSP special issue on Distributed Processing in Vision Networks, and guest editor of CVIU special issue on Multimodal Sensor Fusion. Hamid has been co-founder and technical co-chair of the first International Conference on Distributed Smart Cameras (ICDSC 2007), and general co-chair of ICDSC 2008. He has organized short courses on Distributed Vision Processing in Multi-Camera Networks at CVPR 2007, CVPR 2008, ACIVS 2007, and ICASSP 2009, and has served as chair at: special session on Distributed Processing in Smart Camera Networks at ICASSP 2007, workshop on Behaviour Monitoring and Interpretation at German AI Conference 2008, special session on Vision-based Reasoning at AITAmI workshop at ECAI 2008, workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications at ECCV 2008, special session on Multi-Sensor HCI for Smart Environments at Face and Gesture Conference 2008, and workshop on Vision Networks for Behaviour Analysis (VNBA) at ACM Multimedia 2008. Hamid obtained his Ph.D. degree in electrical engineering from Stanford University in 1995.

Nicu Sebe is with the Faculty of Science, University of Amsterdam, The Netherlands and recently joined the Faculty of Cognitive Sciences, University of Trento, Italy, where he is leading the research in the areas of multimedia information retrieval and human-computer interaction in computer vision applications. He is the author of Robust Computer VisionTheory and Applications (Kluwer, April 2003) and of Machine Learning in Computer Vision (Springer, May 2005). He was involved in the organization of the major conferences and workshops addressing the computer vision and human-centered aspects of multimedia information retrieval, among which as a General Co-Chair of the IEEE Automatic Face and Gesture Recognition Conference, FG 2008 and ACM International Conference on Image and Video Retrieval (CIVR) 2007, and as one of the initiators and a Program Co-Chair of the Human-Centered Multimedia track of the ACM Multimedia 2007 conference. He is the general chair of WIAMIS 2009, ACM CIVR 2010 and a track chair of WWW 2009 and ICPR 2010. He has served as the guest editor for several special issues in IEEE Computer, Computer Vision and Image Understanding, Image and Vision Computing, Multimedia Systems, and ACM TOMCCAP. He has been a visiting professor in Beckman Institute, University of Illinois at Urbana-Champaign and in the Electrical Engineering Department, Darmstadt University of Technology, Germany. He was the recipient of a British Telecomm Felowship. He is the co-chair of the IEEE Computer Society Task Force on Human-centered Computing and is an associate editor of IEEE Transactions on Multimedia, Machine Vision and Applications, Image and Vision Computing, Electronic Imaging and of Journal of Multimedia.