Taskflow recognition in everyday manipulation tasks

CoTeSys

Seamless integration of a cognitive system into technical processes and actions of daily life requires an abstraction of the observed actions into discrete identifiable elements and their generalization over a large number of possible variations due to different subjects’ skills. Reasons for the variations include different experience level, kinematic configuration, like height and strength of a human subject, and spatial constraints. This project aims to investigate a combined analysis of perception and workflow segmentation of everyday manipulation tasks, such as preparing meals, cooking, and assembly tasks together with a human counterpart. The main objective is to find synchronization points in the varying trials and couple them to sensor perception to focus the attention of the system and to reduce the resource requirements.

An integrated perception and understanding system will be realized in the project that combines mechanisms for activity observation (based on existing sensor systems), for learning discrete and continuous probabilistic models of activities, and for the symbolic and subsymbolic reasoning about activity task-flows and representation thereof. The main research contribution will be a real-time identification of the currently performed step within the task-flow using a model built automatically from performance data. We plan to map critical task-flow events to sensor data changes to automatically synchronize the data and to define the task vocabulary from it. Figure 1 below shows a schematic overview of such a system.

Task Flow

Figure 1: Overview of the integrated taskflow recognition system. Input from various sensing devices is used to identify characteristic signal patterns and to generate an activity atlas (left). A graphical model is created from the activity atlas and additional, high-level knowledge about typical taskflow steps (center). In the recognition phase, the environment is able to understand the activities of a human actor and to provide context-aware assistance (right).

Training task-flow monitoring systems from unsupervised performance data is a very promising research area at the core of activity recognition systems targeting everyday manipulation tasks. The four principal investigators bring in unique and complementary competences and resources for the successful achievement of this project’s research goals. These goals are threefold: Summary Seemless integration of a cognitive system into technical processes and actions of daily life requires an abstraction of the observed actions into discrete identifiable elements and their generalization over a large number of possible variations due to different subjects’ skills. Reasons for the variations include different experience level, kinematic configuration, like height and strength of a human subject, and spatial constraints. This project aims to investigate a combined analysis of perception and workflow segmentation of everyday manipulation tasks, such as preparing meals, cooking, and assembly tasks together with a human counterpart. The main objective is to find synchronization points in the varying trials and couple them to sensor perception to focus the attention of the system and to reduce the resource requirements.

An integrated perception and understanding system will be realized in the project that combines mechanisms for activity observation (based on existing sensor systems available in the PI’s laboratories), for learning discrete and continuous probabilistic models of activities, and for the symbolic and subsymbolic reasoning about activity task-flows and representation thereof. The main research contribution will be a real-time identification of the currently performed step within the task-flow using a model built automatically from performance data. We plan to map critical task-flow events to sensor data changes to automatically synchronize the data and to define the task vocabulary from it.

Training task-flow monitoring systems from unsupervised performance data is a very promising research area at the core of activity recognition systems targeting everyday manipulation tasks. The four principal investigators bring in unique and complementary competences and resources for the successful achievement of this project’s research goals. These goals are threefold:

Efficient perception system: On the sensory level, information coming from various systems available in the research labs from the PIs will be synchronized and recorded to generate an atlas of sensory data for each demonstrative task-flow. This data will be used for training and for visual feedback. To bridge the gap between raw signals and the high-level modeling, repetitive patterns inside the raw signals will be identified to form an intermediate vocabulary that will be integrated to the atlas and later used to construct the model. Additionally, essential information such as the relative importance of each sensor with respect to recognition of subparts of the task-flow will be explicited. It will analyse the recognition abilities of systems where a limited number of sensors is available.

Automatic activity modeling: On the cognition level, a probabilistic representation of the taskflow will be automatic learned and constructed based on the atlas of examples and using a simple and high-level task-flow description. The representation will include knowledge about the semantical meaning of actions and also account for variations in the perceptual information and varying taskflows. After learning, the model will probabilistically integrate implicit knowledge that is rarely provided by high-level task-flow descriptions and permit the detection of anomalies. Monitoring-oriented user interface: For demonstration, a user interface will be designed that permits synchronized replay of recorded task-flows for comparison and evaluation, real-time recognition of actions, anomaly notification and visual guidance.

The integrated system will be demonstrated on two scenarios: cooking a meal in the Assistive Kitchen and a construction task in the Cognitive Factory. These two scenarios illustrate perfectly the importance of task-flow understanding and recognition within a cognition perception-action loop.

State of the Art

Numerous works on isolated gesture/activity recognition and classification based on sensory data have been presented. In the following, we focus only on related works addressing recognition inside a task-flow or a scenario and works on pattern or activity discovery within raw sensory data, since the contributions of the project will essentially occur in these two domains.

Using video surveillance cameras only, the AVITRACK[2] European project designed a framework for monitoring the aircraft parking zone of an airport. 3D models of the objects and detailed scenarios composed of events and states are used as a-priori information to ease the recognition. For human activity recognition, [9] propose a software infrastructure based on situation models, assuming that humans activities follow a loosely defined script. The situation models are a formal representation of these scripts and defined as networks of situations concerning roles and relations. They are also supposed to be given a-priori. In [18], stochastic grammars are used to describe actions sequences. This is used to demonstrate action recognition on the tower of Hanoi game. In [27], a blood glucose monitoring system is proposed to help elderly people. A propagation network is introduced that is capable of recognizing concurrent actions. The network topology is also described a-priori. In a further work [26], the network is only initialized from few training data and refined in an unsupervised fashion from unlabeled data. Similarly to everyday manipulation tasks, surgeries occurring frequently have a well defined task-flow. [21] propose to model the scenario of an endoscopic surgery as an automat for robotic control. [15] propose to model the transitions between the tasks using conditions on the instruments for the same purpose. In [23], the known surgical phase ordering and HMMs are used for surgical phases recognition. In [7], it is shown how the semantic of the signals can be used to annotate model nodes within a surgical task so as to obtain automatically a human readable model.

We aim at only using a very high level description of the task-flow that does not contain any indication on the object used and on the performed interactions, but simply the different steps. The training will consist in having the system learn this information from the sample sequences. No assumption will be made on the available sensory data.

Unsupervised activity discovery in raw data is a preliminary step to building automatically a task-flow model, as it maps the raw data to an intermediate vocabulary. [14] present suffix trees to detect anomalies and recurring activities occurring during everyday manipulation tasks. In [20], a different approach using subsequence comparisons and border refinement is proposed to detect repetitive pattern within multidimensional data. Finally, [19] propose an approach to detect patterns that may only occur in a subset of the data dimensions. A major difficulty within these approaches is the scale of activities. Low scales show detailed events that might only be simple constituents of what we usually call an activity. We aim in this project at discovering patterns at low scales that will be used for the model generation. The components of the graphical model, which will contain probability distributions on these patterns, will model the activities.

Goals and Methods

The project targets the perception, interpretation and analysis of everyday manipulation tasks, which as such follow a known and repetitive task-flow. Many differences may however exist between different instances, even when performed by the same person: subtasks may have different lengths, places might change, numbers and quantities can vary. In addition to these variations, much implicit knowledge is present in all subtasks. This is the reason why a novice cook for instance rarely manage to prepare the same dish as the professional cook who wrote the recipe. Many steps require refined actions which can only be learned with supervision or experience. In the simpler example of a waiter bringing a dish to the customer, he will naturally put the plate on the table, neither on the floor nor on his head. The assumption of this proposal is that this knowledge can be learned from an atlas of sensory data provided by multiple recordings of the task-flow. Contrary to much existing work which strives to fully model a-priori the task-flow, we aim at an automatic construction of a probabilistic representation using only a rough high-level description containing the few high-level steps. The probabilistic representation will incorporate the implicit knowledge, cope with variations and be used for subtask recognition and anomaly detection.

The project is divided into three parts that permit the construction of a fully integrated system. The first part focuses on the sensory level to gather and provide meaningful input to the learning and recognition processes. For this part, no equipment is required from CoTeSys as each partner already possesses sensors from other projects. The chair of Prof. Navab will provides a multi-camera system that can be used for real-time 4D reconstruction and a full-body marker-based tracking system. The chair of Prof. Beetz provides a sensor network including RFIDs and a camera-based full-body tracking system. The group from Prof. Burschka will provide signal-level analysis tools for direct matching of events defining boundaries of procedure steps to changes in the appearance in the images obtained from the cameras. The goal here is to find common robust cues for task segmentation from camera information. A simple example to visualize the process can be given by monitoring the area around hands of the user to identify tool changes. This can be used directly for monitoring of changes in the tools used (new knife) without explicit posture tracking. This is an extension of previous work of the PI on VICs (Visual Interaction Components) in HCI context. Prof. Essa provides his experience on the sensor based aware-home project from GeorgiaTech. The second part is the most fundamental and innovative inside this project. It targets the automatic probabilistic representation of the task-flow and of its implicit knowledge based on an example atlas. Using this model, it will also perform a relative analysis of the essential sensors with respect to the recognition of subparts of the task-flow. The third part consists in the integration of the achievements of parts one and two into a user interface, augmented by smart features for visualization, comparison, monitoring and guidance.

In detail, the objectives are:

  1. Perception system
    1. Optimal sensor placement and data fusion. Existing sensor information from stationary cameras and cameras on the cognitive systems contributes different results in varying information gain for the resulting model description. An objective here is a selection of best sensor placements to reduce the uncertainty in the information in a most efficient way. The view planning will be mapped on existing stationary sensors in the scene and possible actions of mobile agents.
    2. Atlas acquisition. For each scenario, a database of instances of the task- ow, performed by different persons, will be recorded. Raw signal information as well as the feature vectors will be stored in the task- ow atlas, that will further be used as input data for training and electronic documentation for guidance.
    3. Vocabulary construction. The raw data contain repetitive patterns that can be considered as low-level recurring events. These events are essential for the understanding of the higher level activities, similarly to phonemes in speech recognition. In order to simplify the model construction, a method to automatically detect these patterns at different scales will be designed to form a low-level event vocabulary.
  2. Learning and use of activity models
    1. Automatic modeling for recognition. Using a high-level task- ow description containing alternatives as input and the unlabeled atlas, a method for constructing a probabilistic representation in terms of a graphical model of the task- ow will be designed. The topology and parameters of the model will be derived partially from the task- ow description and principaly from the instances of the atlas translated to the intermediate vocabulary. Variations in the realization of the task- ow and implicit knowledge will be automatically inferred during the construction and modeled probabilistically. A state inference method will also be designed to synchronize the real-time data to the model and thus perform on-line task recognition.
    2. Integration of anomaly detection. Important for the demonstration is the detection of anomalies. An anomaly occur when a task is skipped or performed differently according to the implicit knowledge. An algorithm for inference and distinction of both different cases will be designed. It will provide its results in terms of likelihoods.
    3. Optimal sensor set evaluation. In real applications, not all sensors will be available. We distinguish between sensor subsets for model construction and sensor subsets for recognition. For instance a richer set of sensor can be used for training, while a smaller set can be used for the real-time recognition. Both cases will be evaluated: the recognition quality as function of the sensors used for model construction and as function of the sensors used for real-time recognition. The quality will be presented in terms of an overall recognition score as well as in terms of local recognition scores for each subtasks. This will ease the choice of meaningful sensors for an efficient and cost-effective recognition of the taskflow.
    4. Semantic model refinement. During the use of the system, anomalies will occur and be detected. The model will be extended to easily include additional semantic information about the discovered anomalies. This form of supervised semantic integration during the real use of the system will improve the feedback quality and also the identification of anomalies that occur within implicit activities.
  3. Monitoring-oriented user interface
    1. Smart atlas browser. The user interface will enable convenient visualization of the atlas. Different visualization modes will be available for intuitive visualization of the multidimensional signals and a synchronized replay of different instances of the task- ow, including signals and videos will be provided.
    2. Activity Recognition. The model, the real-time signals and the estimated state within the taskflow will be displayed. This will allow guidance and monitoring by indicating the next steps, possibly showing synchronized demonstration videos from the atlas and estimating the remaining time till the end of the subtask or till the end of the taskflow.
    3. Extensions The interface will be extended to add semantic information to the model while viewing synchronized replays of task- ow instances. The interface will also display the recognition scores when subsets of sensors are used and permit a simulated replay and detection of the recorded instances, when only these sensors would be used. It will also permit to select which subset of sensor is used during a real-time experiment.

People

Nassir Navab, Lehrstuhl für Informatikanwendungen in der Medizin & Augmented Reality, Fakultät für Informatik, Technische Universität München.

Irfan Essa (Associated PI), Computational Perception Lab, GeorgiaTech, Atlanta, USA.

Darius Burschka, Lehrstuhl für Robotik und Integrierte Systeme, Fakultät für Informatik, Technische Universität München.

Michael Beetz, Lehrstuhl für Bildverstehen und Wissensbasierte Systeme, Fakultät für Informatik, Technische Universität München.

Dipl.-Inf. Oliver Ruepp, Lehrstuhl für Robotik und Integrierte Systeme, Fakultät für Informatik, Technische Universität München.

Nicolas Padoy, Lehrstuhl für Informatikanwendungen in der Medizin & Augmented Reality, Fakultät für Informatik, Technische Universität München.

Diana Mateus, Lehrstuhl für Informatikanwendungen in der Medizin & Augmented Reality, Fakultät für Informatik, Technische Universität München.

References

[1] http://awarehome.imtc.gatech.edu/.

[2] http://www.avitrack.net/.

[3] Seyed-Ahmad Ahmadi, Nicolas Padoy, Sandro Michael Heining, Hubertus Feussner, Martin Daumer, and Nassir Navab. Introducing wearable accelerometers in the surgery room for activity detection. In 7. Jahrestagung der Deutschen Gesellschaft für Computer-und Roboter-Assistierte Chirurgie (CURAC 2008), Leipzig, Germany, September 2008.

[4] Seyed-Ahmad Ahmadi, Tobias Sielhorst, Ralf Stauder, Martin Horn, Hubertus Feussner, and Nassir Navab. Recovery of surgical workflow without explicit models. In MICCAI ’06, pages 420–428, 2006.

[5] Jan Bandouch, Florian Engstler, and Michael Beetz. Accurate human motion capture using an ergonomics-based anthropometric human model. In Fifth International Conference on Articulated Motion and Deformable Objects (AMDO), 2008.

[6] Michael Beetz, Freek Stulp, Bernd Radig, Jan Bandouch, and Nico Blodow. The assistive kitchen - a demonstration scenario for cognitive technical systems. In IEEE 17th International Symposium on Robot and Human Interactive Communication (RO-MAN), 2008.

[7] Tobias Blum, Nicolas Padoy, Hubertus Feuner, and Nassir Navab. Workflow mining for visualization and analysis of surgeries. International Journal of Computer Assisted Radiology and Surgery, 2008.

[8] Tobias Blum, Nicolas Padoy, Hubertus Feussner, and Nassir Navab. Modeling and online recognition of surgical phases using hidden markov models. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), New York, USA, September 2008.

[9] Oliver Brdiczka, James L. Crowley, and Patrick Reignier. Learning situation models for providing context-aware services. In HCI (6), pages 23–32, 2007.

[10] D. Burschka and G. Hager. Stereo-Based Obstacle Avoidance in Indoor Environments with Active Sensor Re-Calibration. In International Conference on Robotics and Automation, pages 2066–2072, 2002.

[11] Darius Burschka. Videobasierte Exploration von Innenr¨aumen am Beispiel eines binokularen Stereo-Kamerasystems. PhD thesis, Department of Electrical Engineering, Technische Universit¨at M¨ unchen, December 1998.

[12] Darius Burschka and Gregory D. Hager. V-GPS(SLAM): – Vision-Based Inertial System for Mobile Robots. In Proc. of ICRA, pages 409–415, April 2004.

[13] G. Hager and D. Burschka. Laser-based Position Tracking and Map Generation. In Proceedings of Robotics and Automation, pages 149–155, August 2000.

[14] Raffay Hamid, Siddhartha Maddi, Aaron Bobick, and Irfan Essa. Structure from statisticsunsupervised activity analysis using suffix trees. In ICCV, 2007.

[15] Seong-Young Ko, Jonathan Kim, Woo-Jung Lee, and Dong-Soo Kwon. Surgery task model for intelligent interaction between surgeon and laparoscopic assistant robot seong-y. International Journal of Assitive Robotics and Mechatronics, 8(1):38–46, 2007.

[16] Alexander Ladikos, Selim Benhimane, and Nassir Navab. Real-time 3d reconstruction for collision avoidance in interventional environments. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2008.

[17] Gunter Magin, Achim Ruß, Darius Burschka, and Georg F¨arber. A dynamic 3D environmental model with real-time access functions for use in autonomous mobile robots. Robotics and Autonomous Systems, 14:119 – 131, 1995.

[18] David Minnen, Irfan A. Essa, and Thad Starner. Expectation grammars: Leveraging high-level expectations for activity recognition. In CVPR (2), pages 626–632, 2003.

[19] David Minnen, Charles L. Isbell, Irfan A. Essa, and Thad Starner. Detecting subdimensional motifs: An efficient algorithm for generalized multivariate pattern discovery. In ICDM, pages 601–606, 2007.

[20] David Minnen, Thad Starner, Irfan A. Essa, and Charles Lee Isbell Jr. Improving activity discovery with automatic neighborhood estimation. In IJCAI, pages 2814–2819, 2007.

[21] F. Miyawaki, K. Masamune, S. Suzuki, K. Yoshimitsu, and J. Vain. Scrub nurse robot system - intraoperative motion analysis of a scrub nurse and timed-automata-based model for surgery. IEEE Transactions on Industrial Electronics, 52(5):1227–1235, 2005.

[22] Nicolas Padoy, Tobias Blum, Irfan Essa, Hubertus Feussner, Marie-Odile Berger, and Nassir Navab. A boosted segmentation method for surgical workflow analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 102–109, Brisbane, Australia, October 2007.

[23] Nicolas Padoy, Tobias Blum, Hubertus Feussner, Marie-Odile Berger, and Nassir Navab. On-line recognition of surgical activity for monitoring in the operating room. In AAAI, pages 1718–1724, 2008.

[24] Radu Bogdan Rusu, Brian Gerkey, and Michael Beetz. Robots in the kitchen: Exploiting ubiquitous sensing and actuation. In Robotics and Autonomous Systems Journal (Special Issue on Network Robot Systems), 2008.

[25] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mihai Dolha, and Michael Beetz. Towards 3d point cloud based object maps for household environments. In Robotics and Autonomous Systems Journal (Special Issue on Semantic Knowledge), 2008.

[26] Yifan Shi, Aaron F. Bobick, and Irfan A. Essa. Learning temporal sequence model from partially labeled data. In CVPR (2), pages 1631–1638, 2006.

[27] Yifan Shi, Yan Huang, David Minnen, Aaron F. Bobick, and Irfan A. Essa. Propagation networks for recognition of partially ordered sequential action. In CVPR (2), pages 862–869, 2004.

[28] Tobias Sielhorst, Ralf Stauder, Martin Horn, Thomas Mussack, Armin Schneider, Hubertus Feussner, and Nassir Navab. Simultaneous replay of automatically synchronized videos of surgeries for feedback and visual assessment. International Journal of Computer Assisted Radiology and Surgery,Supplement 1, 2:433–434, June 2006.

[29] Freek Stulp and Michael Beetz. Combining declarative, procedural and predictive knowledge to generate and execute robot plans efficiently and robustly. In Robotics and Autonomous Systems Journal (Special Issue on Semantic Knowledge), 2008.

[30] Freek Stulp and Michael Beetz. Refining the execution of abstract actions with learned action models. In Journal of Artificial Intelligence Research (JAIR), 2008.