HUMAN-CENTERED VISION SYSTEMS
Hamid Aghajan, Stanford University, USA
and
Nicu Sebe, University of Trento, Italy
Abstract
This tutorial will take a holistic view on the research issues and applications of Human-Centered Vision
Systems focusing on three main areas: (1) multimodal interaction: visual (body, gaze, gesture)
and audio (emotion) analysis; (2) smart environments; (3) distributed and collaborative fusion
of visual information.
Human-computer Interaction lies at the crossroads of many research areas (computer vision, multimedia,
psychology, artificial intelligence, pattern recognition, etc.) and is used in a wide range of applications.
In particular, we are aiming at developing human-centered information systems. The most important issue
here is how to achieve synergism between man and machine. The term human-centered is used to emphasize
the fact that although all existing vision systems were designed with human users in mind, many of
them are far from being user friendly. What can the scientific/engineering community do to affect
a change for the better?
On the one hand, the fact that computers are quickly becoming integrated into everyday objects
(ubiquitous and pervasive computing) implies that effective natural human-computer interaction
is becoming critical (in many applications, users need to be able to interact naturally with computers
the way face-to-face human-human interaction takes place). On the other hand, the wide range of
applications that use multimedia, and the amount of multimedia content currently available, imply
that building successful computer vision and multimedia applications requires a deep understanding
of multimedia content. The success of human-centered vision systems, therefore, depends highly on
two joint aspects: (1) the way humans interact naturally with such systems (using speech and body
language) to express emotion, mood, attitude, and attention, and (2) the human factors that pertain
to multimedia data (human subjectivity, levels of interpretation).
In this tutorial, we take a holistic approach to the human-centered vision systems problem.
We aim to identify the important research issues, and to ascertain potentially fruitful future
research directions in relation to the two aspects above. In particular, we introduce key concepts,
discuss technical approaches and open issues in three areas: ((1) multimodal interaction: visual
(body, gaze, gesture) and audio (emotion) analysis; (2) smart environments; (3) distributed and
collaborative fusion of visual information.
The tutorial sets forth application design examples in which a user-centric methodology is adopted
across the different stages from feature and pose estimation in early vision to user behavior
modeling in high-level reasoning. The role of query for users feedback will be discussed with
examples in smart home applications. Several implemented applications based on the notion of
user-centric design will be introduced and discussed.
The focus of the short course, therefore, is on technical analysis and interaction techniques
formulated from the perspective of key human factors in a user-centered approach to developing
Human-Centered Vision Systems.
Outline
This tutorial will enable the participants to understand key concepts, state-of-the-art techniques,
and open issues in the areas described below. In relation to the conference, the tutorial will cover
parts of the following topic areas:
-
New paradigms for HCI: smart environments, smart networked objects, augmented + mixed realities, ubiquitous computing, pervasive computing, tangible computing, intelligent interfaces and wearable computing.
-
Vision for smart environments: overview of techniques and state of the art in body tracking and pose, gaze detection, etc.
-
Multi-camera networks: user activity and behavior modeling, smart homes, occupancy-based services, distributed and collaborative processing.
-
Multimodal emotion recognition for affective retrieval and in affective interfaces: approaches to multimedia content analysis and interaction that use speech and facial expression recognition.
-
Machine learning: adaptive multimodal interfaces and learning of visual concepts from user input for automatic detection and recognition (detection of scenes, objects, or events of interest).
-
Multimodal fusion: technical approaches and issues in combining multiple media (e.g., audio-visual) for multimodal interaction and multimedia analysis.
-
Interfaces between vision processing module and high-level reasoning, the role of feedback to vision, knowledge accumulation, user behavior modeling, environment discovery
-
Applications: traditional and emerging application areas will be described with specific examples in smart conference room research, arts, interaction for people with disabilities, entertainment, and others.
Background and Potential Target Audience
The short course is intended for PhD students, scientists, engineers, application developers,
computer vision specialists and others interested in the areas of information retrieval
and human-computer interaction. A basic understanding of image processing and machine learning
is a prerequisite.
Biography
Hamid Aghajan
is a professor of Electrical Engineering (consulting) at
Stanford University since 2003. Areas of research in his group consist of multi-camera networks and human
interfaces for smart, vision-based reasoning environments, with application to smart homes, occupancy-based
services, assisted living and well being, smart meetings, and avatar-based communication and social
interactions. Hamid is co-editor-in-chief of the Journal of Ambient Intelligence and Smart Environments.
He has co-authored 3 edited volumes on: Human-centric Interfaces for Ambient Intelligence, Multi-Camera
Networks Principles and Applications, and Handbook of Ambient Intelligence and Smart Environments.
He has been editorial board member of the book series on Artificial Intelligence and Smart Environments
by IOS Press, associate editor of Machine Vision and Applications, guest editor of IEEE J-STSP special
issue on Distributed Processing in Vision Networks, and guest editor of CVIU special issue on Multimodal
Sensor Fusion. Hamid has been co-founder and technical co-chair of the first International Conference
on Distributed Smart Cameras (ICDSC 2007), and general co-chair of ICDSC 2008. He has organized short
courses on Distributed Vision Processing in Multi-Camera Networks at CVPR 2007, CVPR 2008, ACIVS 2007,
and ICASSP 2009, and has served as chair at: special session on Distributed Processing in Smart Camera
Networks at ICASSP 2007, workshop on Behaviour Monitoring and Interpretation at German AI Conference 2008,
special session on Vision-based Reasoning at AITAmI workshop at ECAI 2008, workshop on Multi-camera
and Multi-modal Sensor Fusion Algorithms and Applications at ECCV 2008, special session on Multi-Sensor
HCI for Smart Environments at Face and Gesture Conference 2008, and workshop on Vision Networks for
Behaviour Analysis (VNBA) at ACM Multimedia 2008. Hamid obtained his Ph.D. degree in electrical
engineering from Stanford University in 1995.
Nicu Sebe
is with the Faculty of Science, University of Amsterdam,
The Netherlands and recently joined the Faculty of Cognitive Sciences, University of Trento, Italy,
where he is leading the research in the areas of multimedia information retrieval and
human-computer interaction in computer vision applications. He is the author of Robust Computer
VisionTheory and Applications (Kluwer, April 2003) and of Machine Learning in Computer Vision
(Springer, May 2005). He was involved in the organization of the major conferences and workshops
addressing the computer vision and human-centered aspects of multimedia information retrieval,
among which as a General Co-Chair of the IEEE Automatic Face and Gesture Recognition Conference,
FG 2008 and ACM International Conference on Image and Video Retrieval (CIVR) 2007, and as one of
the initiators and a Program Co-Chair of the Human-Centered Multimedia track of the ACM Multimedia
2007 conference. He is the general chair of WIAMIS 2009, ACM CIVR 2010 and a track chair of WWW 2009
and ICPR 2010. He has served as the guest editor for several special issues in IEEE Computer,
Computer Vision and Image Understanding, Image and Vision Computing, Multimedia Systems, and ACM TOMCCAP.
He has been a visiting professor in Beckman Institute, University of Illinois at Urbana-Champaign and
in the Electrical Engineering Department, Darmstadt University of Technology, Germany. He was the
recipient of a British Telecomm Felowship. He is the co-chair of the IEEE Computer Society Task Force
on Human-centered Computing and is an associate editor of IEEE Transactions on Multimedia, Machine Vision
and Applications, Image and Vision Computing, Electronic Imaging and of Journal of Multimedia.
|