Ir al contenido

Documat


Resumen de Visuomotor representations of the peripersonal space in humanoid robots. Active learning by gazing and reaching

Marco Antonelli

  • INTRODUCTION In the last decades, robots have been moving from industry to the humans¿ daily life environment. This has changed drastically the requirements that robots are expected to carry out: instead of acting in well-known and structured setups, they have to deal with changing, unstructured environments and to interact with humans. In such conditions, executing the given task as fast as possible is still a desirable capability, but not more the priority. Crucial features are now the capability of adapting to unpredicted situations and of acting ensuring the safety of the humans they are interacting with. For this reason, also the embodiment of the robots is changing. On the one hand, new sensors such as RGB-D cameras , robot skins and tactile sensors have been developed to increase the information that robots can acquire about its surroundings. On the other hand, modern robots tend to be endowed with compliant actuators and soft materials due to their multifold advantages, such as avoiding to hurt humans or damaging itself. Furthermore, in order to make robots available to a wide number of people new robots are usually less expensive and less accurate than the ones that can be found in industry. From the computer science point of view, this tendency requires the development of new models for processing and integrating multimodal sensory inputs, both to provide the adequate control signals to the actuators and to create faithful representations of the environment. In order to ensure lifelong autonomy of the robot, the control architecture should also be robust to changes of the physical configuration of the robot, due for example to wear, damages or to the use of un-modeled tools. Thus, the system should be able to autonomously recalibrate its internal parameters while the robot is interacting with the environment.

    The purpose of this thesis is to investigate new techniques directed toward the achievement of a complete visual awareness of the surrounding space of a humanoid robot. In particular, we focus on the sensorial information provided by the cameras of the robotic head and by the encoders of the head and limb motors. Our first goal is to obtain a robot capable of observing the surroundings to create a coherent representation of the environment which is purposive for achieving the task at hand. Second, during its normal behavior, the robot should be able to supervise and evaluate the outcome of its actions in order to correct them and keep up-to-date its internal model.

    BACKGROUND In order to address the goals described in the introduction we take inspiration from biology and in particular from neuroscience. Nature has faced similar problems in the development of organisms, and evolution has led to systems that are extremely efficient in processing sensorial information. While multi- ple differences exist between the typical architectures of robotic systems and the mechanisms of organisms, biological systems provide an important source of inspiration for developing new cognitive abilities on robots. On the other hand, implementing computational models of biological systems on robots can provide useful feedback to neuroscience community. Even though the proposed system is biologically inspired, we limit the parallelism with biological systems to high-level concepts, and we model low-level characteristics according to real- time requirements imposed by robotic applications. Herein, we highlight some fundamental principles that emerge from neuroscience findings and which are the ¿leitmotif¿ of this study:

    1. Perception and behavior are indissolubly coupled. Unlike many artificial sys- tems, organisms are not passively exposed to the incoming flow of sensory data, but actively seek useful information by coordinating sensory processing with motor activity. Behavior is always present, even when it is not immediately ob- vious. For example, in humans and many other species, microscopic head and eye movements occur even in the periods of ¿visual fixation¿: the brief intervals (¿300 ms) in between macroscopic relocations of gaze, in which humans acquire visual information. Despite their small amplitude, they operate a critical refor- matting of the spatiotemporal stimulus on the retina (Kuang et al., 2012), are under oculomotor control (Ko et al., 2010), and contribute to the processing of visual information and the establishment of spatial representations (Aytekin and Rucci, 2012; Poletti et al., 2013).

    Using purposive behavior in robotics to simplify ill-posed problems such as the ones regarding early-vision is not novel (Bertero et al., 1988; Aloimonos et al., 1988), however, the contribution that behavior can provide at small scale has just been superficially revealed (Hongler et al., 2003; Santini and Rucci, 2007).

    2. Learning is always present. Goal-directed movements aimed to create a space representation to accomplish a given task require an internal model of the sensori-motor apparatus. The body of biological systems continually changes with age, and the parameters of the internal model have to be adjusted accord- ingly. This is achieved by encoding the associations among sensori-motor cues in plastic maps that are updated by means of correlation rules. An evidence of this process in humans is provided by the saccadic adaptation paradigm (McLaugh- lin, 1967). In this setup, while humans perform a saccade toward a target, the target is secretly moved to make the brain believe that the performed movement was erroneous. Performing this task systematically induces the brain to learn to saccade to the displaced target, showing the plasticity of the internal model of the oculomotor system (Lappe, 2009). This adaptive behavior does not only influences the motion of the eye, but also the perception of space (Collins et al., 2007; Lappe, 2009).

    In robotics, neural networks and learning strategies have been extensively used, however in most cases the systems need a human supervisor that provides the correct teaching signal. Autonomous robot and long-life learning require a system that can ¿self-supervise¿ itself to detect and correct potential fails. It is achieved by avoiding explicit artificial representations and by representing the space in terms of sensory cues such as joints angles. This methodology creates distributed and implicit representations which are intrinsically linked to the behavior: the representations of both the environment and the internal model are inseparably intertwined for achieving an action, and hence they should be developed in parallel (Wörgötter et al., 2009).

    3. Perception relies on the optimal integration of multiple cues. The perception of the surroundings in humans and primates is an extremely parallel process. For example, depth perception relies on the simultaneous processing of more than 20 visual cues, which are not restricted to the visual modality. The close coupling between behavioral and visual processes yields depth information directly in the motor and proprioceptive modalities (e.g., vergence and focus adjustments).

    Moreover, a substantial body of work shows that humans often integrate differ- ent cues following a statistic optimal approach (Ernst and Banks, 2002; Knill and Pouget, 2004; Stocker and Simoncelli, 2006; Freeman et al., 2010; Geisler, 2011; Poletti et al., 2013). This approach implies that sensory processes not only extract the relevant cues, but also estimate their reliability on the basis of previously acquired knowledge. It implies a probabilistic model of the environ- ment, which needs to be modified during the course of experience (Fiser et al., 2010).

    Similar optimal approaches have been used in robotics and computer vision (Wolpert et al., 1995; Davison et al., 2007; Schrater and Kersten, 2000) , but they rarely integrate cues from visual, motor, and proprioceptive signals (Ferreira et al., 2013), as humans appear to do. In order to investigate the existing interplay between vision and motor control, and to study how to exploit these interactions it is essential to have integrated systems that mimic how humans act and how they process data. Even though some authors have developed integrated robotic systems (Rasolzadeh and Bj ¿orkman, 2010; Azad et al., 2007; Calinon et al., 2007), they typically rely on a computer vision approach, while only a few are inspired by computational neuroscience (Shibata et al., 2001; Hoffmann et al., 2005; McBride et al., 2010; Grzyb et al., 2009).

    THESIS OUTLINE AND RESULTS This thesis has been organized in four main chapters, followed by the conclusions and the appendices. First, we have shown how behavior in form of ¿fixation eye movement¿ can simplify the creation of a coherent representation of the peripersonal space, that is, the space nearby the body (Chapter 2). In Chapter 3 saccadic movements have been used to learn the internal model of the ¿oculo- motor system¿ of the robot. In Chapter 4, we have introduced arm movement and we have extended the model presented in Chapter 3 to show how the repre- sentations of both the environment and the robot internal model are intertwined and developed in parallel. Finally, we have proposed an integrated architecture that combines visual perception, space representation and behavior. A brief summary of each chapter is included below.

    Chapter 2. We have presented how behavior can be used for creating a coher- ent representation of the nearby space in a humanoid robot. The goal was to extract depth information from a monocular camera mounted on a robotic head.

    The behavior performed by the robot was intended to simulate fixation head movements observed in humans during natural fixation (Aytekin and Rucci, 2012). Thus, the robot moved the neck to generate motion-dependent depth cues and it simultaneously moved the eye to keep the fixation on a reference object. During the execution of this behavior, the robot integrated propriocep- tive and visual cues to generate a retinotopic (pixel-based) depth map that was suitable to be integrated with other depth cues such as binocular disparity. The achieved results showed how this small motion provides faithful depth informa- tion in the agent¿s peripersonal space. The contents of this chapter are based on the following publications: Antonelli et al. (2014a, 2013a).

    Chapter 3. Robot behavior and multi-modal cue integration led to the extraction of depth information. In order to do that, one needed the internal model of the robot. In Chapter 3 we have presented how the internal model can be learned by means of the behavior itself. This chapter focused on the learning of the saccade control. Saccades are ballistic movements, that are open-loop with respect to vision, and their goal is to generate a proper eye movement that allows the robot to fixate a desired target object. In our model, learning took place after each saccade when the outcome of the performed action was compared with the sensory perception. The model that we proposed to solve the inverse control problem took inspiration from a computational model of the cerebellum (Porrill et al., 2013) based on recurrent loops between a fixed inverse model and an adaptive forward model. Results showed that the proposed model was suitable to learn the internal model required for executing accurate saccades. We also showed how probabilistic neural networks can be used to select a target in order to speed-up the learning process. A pilot study that compared a preliminary version of the proposed computational model with results observed in humans¿ saccadic adaptation was presented in Appendix C. The contents of this chapter are based on the following publications: Antonelli et al. (2015, 2014b, 2013b).

    Chapter 4. We have show how the robot can achieve an implicit representation of the peripersonal space. This representation was counterpoised to the explicit representation obtained in Chapter 2 in which the position of the objects is represented in the Cartesian space. Indeed, the implicit representation was in- trinsically related to the degrees of freedom of the robot, and emerged from the ability of the robot to perform a required task. When the robot observed a target, a coherent representation of the target location, depended on the task at hand. From this perspective, the 3D representation and the task are inter- twined. In this work, reaching and gazing were performed by transforming a retinotopic position of the target into an eye-centered representation or into an arm-centered one, without the necessity to pass trough a Cartesian represen- tation. This strategy is likely to be implemented in the parietal cortex of the human brain where several populations of neurons encode the target position in the vergence-version space and the arm-centered frame of reference. In this chapter we showed how the robot can take advantage of the visuo-oculomotor transformation learned for the saccadic control to create a representation of the space suitable for reaching and grasping tasks. The contents of this chapter are based on the following publications: Chinellato et al. (2011); Antonelli et al.

    (2012a).

    Chapter 5. The 3D representation of the space presented so far was mainly related to the so called ¿vision for action¿. In Chapter 5 we showed how to integrate ¿vision for action¿ with ¿vision for perception¿, which is in charge of recognizing and representing objects. These two different aspects of vision are processed in the human brain by two parallel and interconnected processing pathways, the dorsal and the ventral streams, respectively (Goodale and West- wood, 2004). In this chapter we proposed a modular software architecture to integrate these two different aspects of vision and we showed the importance of considering them contextually. We concluded that the ventral stream ability to recognize objects, and thus identify the correct target, was instrumental in calibrating the dorsal stream. The mathematichal detail of the computational models used in the integrated architecture were provided in Appendix B. The contents of this chapter are based on the following publication: Antonelli et al.

    (2014c).

    Appendices. We have concluded the thesis with four appendices. In addition to the two appendices mentioned above we have provided the description of the robots we used to test the proposed models in Appendix A and a CUDA- based implementation of the log-polar transform in Appendix D. We have com- pared the proposed implementation with other parallel implementations that used Shaders and multi-core processors. Log-polar imaging is a kind of foveal, biologically-inspired visual representation with advantageous properties in prac- tical applications in computer vision and robotics. The proposed implementa- tion allowed us to achieve real-time performance (30 frames/s) for gray-level images as big as 4096 × 4096 pixels, or for 1024 × 1024 color images. The con- tents of this chapter are based on the following publication: Antonelli et al.

    (2012b).

    MAIN REFERENCES Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Int. J.

    Comput. Vision, pages 333¿356.

    Antonelli, M., del Pobil, A., and Rucci, M. (2013a). Depth Estimation during Fixational Head Movements in a Humanoid Robot, volume 7963 of Lect. Notes Comput. Sc., pages 264¿273. Springer Berlin Heidelberg.

    Antonelli, M., del Pobil, A. P., and Rucci, M. (2014a). Bayesian multimodal integration in a robot replicating human head and eye movements. In IEEE International Conference on Robotics and Automation (ICRA), pages 2868¿ 2873.

    Antonelli, M., Duran, A., Chinellato, E., and del Pobil, A. P. (2013b). Speeding- Up the Learning of Saccade Control, volume 8064 of Lecture Notes in Com- puter Science, pages 12¿23. Springer Berlin Heidelberg.

    Antonelli, M., Duran, A., Chinellato, E., and del Pobil, A. P. (2015). Adaptive saccade controller inspired by the primates cerebellum. In IEEE International Conference on Robotics and Automation (ICRA)(Accepted).

    Antonelli, M., Duran, A. J., Chinellato, E., and del Pobil, A. P. (2014b). Learn- ing the visual-oculomotor transformation: Effects on saccade control and space representation. Robotics and Autonomous Systems, In press.

    Antonelli, M., Gibaldi, A., Beuth, F., Duran, A., Canessa, A., Chessa, M., So- lari, F., del Pobil, A., Hamker, F., Chinellato, E., and Sabatini, S. (2014c). A hierarchical system for a distributed representation of the peripersonal space of a humanoid robot. Autonomous Mental Development, IEEE Transactions on, 6(4):259¿273.

    Antonelli, M., Grzyb, B. J., Castell ¿o, V., and del Pobil, A. P. (2012a). Plastic Representation of the Reachable Space for a Humanoid Robot, volume 7426 of Lecture Notes in Computer Science, pages 167¿176. Springer Berlin Hei- delberg.

    Antonelli, M., Igual, F., Ramos, F., and Traver, V. (2012b). Speeding up the log-polar transform with inexpensive parallel hardware: graphics units and multi-core architectures. Journal of Real-Time Image Processing, 21 Octuber:1¿18.

    Aytekin, M. and Rucci, M. (2012). Motion parallax from microscopic head movements during visual fixation. Vision Research, 70:7¿17.

    Azad, P., Asfour, T., and Dillmann, R. (2007). Stereo-based 6d object localiza- tion for grasping with humanoid robot systems. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), pages 919¿924.

    Bertero, M., Poggio, T. A., and Torre, V. (1988). Ill-posed problems in early vision. Proceedings of the IEEE, 76(8):869¿889.

    Calinon, S., Guenter, F., and Billard, A. (2007). On learning, representing, and generalizing a task in a humanoid robot. IEEE T. Syst. Man Cy. B, 37(2):286¿98.

    Chinellato, E., Antonelli, M., Grzyb, B. J., and del Pobil, A. P. (2011). Im- plicit sensorimotor mapping of the peripersonal space by gazing and reaching.

    Autonomous Mental Development, IEEE Transactions on, 3:43¿53.

    Collins, T., Dor ¿e-Mazars, K., and Lappe, M. (2007). Motor space structures perceptual space: Evidence from human saccadic adaptation. Brain research, 1172:32¿39.

    Davison, A. J., Reid, I. D., Molton, N. D., and Stasse, O. (2007). Monoslam: Real-time single camera slam. Pattern Anal. Machine Intell., IEEE Trans.

    on, 29(6):1052¿1067.

    Ernst, M. O. and Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429¿433.

    Ferreira, J. F., Lobo, J., Bessiere, P., Castelo-Branco, M., and Dias, J. (2013).

    A bayesian framework for active artificial perception. Cybernetics, IEEE Transactions on, 43(2):699¿711.

    Fiser, J., Berkes, P., Orbn, G., and Lengyel, M. (2010). Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences, 14(3):119¿130.

    Freeman, T. C. A., Champion, R. A., and Warren, P. A. (2010). A Bayesian model of perceived head-centered velocity during smooth pursuit eye move- ment. Current Biology, 20(8):757¿762.

    Geisler, W. S. (2011). Contributions of ideal observer theory to vision research.

    Vision Research, 51(7):771¿781.

    Goodale, M. and Westwood, D. (2004). An evolving view of duplex vision: separate but interacting cortical pathways for perception and action. Curr.

    Opin. Neurobiol., 14(2):203¿211.

    Grzyb, B., Chinellato, E., Morales, A., and del Pobil, A. P. (2009). A 3d grasping system based on multimodal visual and tactile processing. Ind.

    Robot, 36(4):365¿369.

    Hoffmann, H., Schenck, W., and Möller, R. (2005). Learning visuomotor trans- formations for gaze-control and grasping. Biol. Cybern., 93(2):119¿130.

    Hongler, M.-O., de Meneses, Y. L., Beyeler, A., and Jacot, J. (2003). The resonant retina: exploiting vibration noise to optimally detect edges in an image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(9):1051¿1062.

    Knill, D. C. and Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12):712¿719.

    Ko, H. K., Poletti, M., and Rucci, M. (2010). Microsaccades precisely relocate gaze in a high visual acuity task. Nature Neuroscience, 13:1549¿1553.

    Kuang, X., Poletti, M., Victor, J. D., and Rucci, M. (2012). Temporal en- coding of spatial information during active visual fixation. Current Biology, 22(6):510¿514, PMCID: PMC3332095.

    Lappe, M. (2009). What is adapted in saccadic adaptation? The Journal of Physiology, 587(1):5¿5.

    McBride, S., Law, J., and Lee, M. (2010). Integration of active vision and reaching from a developmental robotics perspective. IEEE Trans. Auton.

    Ment. Dev., 2(4):355¿366.

    McLaughlin, S. C. (1967). Parametric adjustment in saccadic eye movements.

    Perception & Psychophysics, 2(8):359¿362.

    Poletti, M., Listorti, C., and Rucci, M. (2013). Microscopic eye movements compensate for nonhomogeneous vision within the fovea. Current Biology, 23(17):1691¿1695.

    Porrill, J., Dean, P., and Anderson, S. R. (2013). Adaptive filters and inter- nal models: Multilevel description of cerebellar function. Neural Networks, 47:134¿149.

    Rasolzadeh, B. and Björkman, M. (2010). An active vision system for detecting, fixating and manipulating objects in the real world. Int. J. Robot. Res., 29(2- 3):1¿40.

    Santini, F. and Rucci, M. (2007). Active estimation of distance in a robotic system that replicates human eye movement. Robot. Auton. Syst., 55(2):107¿ 121.

    Schrater, P. R. and Kersten, D. (2000). How optimal depth cue integration depends on the task. International Journal of Computer Vision, 40(1):71¿89.

    Shibata, T., Vijayakumar, S., Conradt, J., and Schaal, S. (2001). Biomimetic oculomotor control. Adapt. Behav., 9(3-4):189¿207.

    Stocker, A. A. and Simoncelli, E. P. (2006). Noise characteristics and prior ex- pectations in human visual speed perception. Nature Neuroscience, 9(4):578¿ 585.

    Wolpert, D. M., Ghahramani, Z., and Jordan, M. I. (1995). An internal model for sensorimotor integration. Science, 269(5232):1880¿1882.

    Wötgötter, F., Agostini, A., Krüger, N., Shylo, N., and Porr, B. (2009). Cogni- tive agents: a procedural perspective relying on the predictability of Object- Action-Complexes (OACs). Robot. Auton. Syst., 57(4):420¿432.


Fundación Dialnet

Mi Documat