Scene recognition for confined spaces in mobile robotics: current state and tendencies

Svetlana R. Orlova
Peter the Great Saint Petersburg Polytechnical University (SPbPU), Engineer-Researcher, 29, Politekhnicheskaya ul., Saint Petersburg, 195251, Russia, tel.: +7(911)005-31-30, This email address is being protected from spambots. You need JavaScript enabled to view it.

Alexander V. Lopota
Doctor of Technical Science, Associate Professor, Russian State Scientific Center for Robotics and Technical Cybernetics (RTC), Director and Chief Designer, 21, Tikhoretsky pr., Saint Petersburg, 194064, Russia, tel.: +7(812)552-01-10, This email address is being protected from spambots. You need JavaScript enabled to view it., ORCID:0000-0001-8095-9905

Received 7 October 2021

Abstract
The article discusses the problem of scene recognition for mobile robotics. Subtasks that have to be solved to implement a high-level understanding of the environment are considered. The basis here is an understanding of the geometry and semantics of the scene, which can be decomposed into subtasks of robot localization, mapping and semantic analysis. Simultaneous localization and mapping (SLAM) techniques have already been successfully applied and, although they have some as yet unresolved problems for dynamic environments, do not present a problem for this issue. The focus of the work is on the task of semantic analysis of the scene, which assumes three-dimensional segmentation. The field of 3D segmentation, like the field of image segmentation, has been decomposed into semantic and object segmentation, contrary to the needs of many potential applications. However, at present, panoptic segmentation is beginning to develop, combining the two previous ones and most fully describing the scene. The paper reviews the methods of 3D panoptic segmentation, identifies promising approaches. The actual problems of the scene recognition problem are also discussed. There is a clear trend towards the development of complex incremental methods of metric-semantic SLAM, which combine segmentation with SLAM methods, and the use of scene graphs, which allow describing the geometry, semantics of scene elements and the relationship between them. Scene graphs are especially promising for the field of mobile robotics, since they provide a transition from low-level representations of objects and spaces (for example, segmented point clouds) to describing a scene at a high level of abstraction, close to a human one (a list of objects in a scene, their properties and location relative to each other).

Key words
Mobile robotics, machine vision, computer vision, panoptic segmentation, SLAM, graph scene.

Acknowledgements
The study was carried out with financial support of RFBR in the frame of research project No. 20-37-90039.

DOI
10.31776/RTCJ.10102

Bibliographic description
Orlova, S. and Lopota, A., 2022. Scene recognition for confined spaces in mobile robotics: current state and tendencies. Robotics and Technical Cybernetics, 10(1), pp.14-24.

UDC identifier:
004.896:004.832

References

Rosinol, A., Abate, M., Chang, Y. and Carlone, L., 2020. Kimera: an open-source library for real-time metric-semantic localization and mapping. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.1689-1696. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9196885&tag=1> (Accessed 19 January 2022).
Armeni, I. et al., 2019. 3D scene graph: A structure for unified semantics, 3D space, and camera. In: CVF International Conference on Computer Vision (ICCV), pp.5664-5673. Available at: <https://openaccess.thecvf.com/content_ICCV_2019/papers/Armeni_3D_Scene_Graph_A_Structure_for_Unified_Semantics_3D_Space_ICCV_2019_paper.pdf> (Accessed 19 January 2022).
Narita, G., Seno, T., Ishikawa, T. and Kaji, Y., 2019. PanopticFusion: Online volumetric semantic mapping at the level of stuff and things. In: IEEE International Workshop on Intelligent Robots and Systems (IROS), pp.4205-4212. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8967890> (Aaccessed 19 January 2022).
Zhao, H., Shi, J. and Qi, X., 2017. Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp.2881-2890. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8100143> (Accessed 19 January 2022).
He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask R-CNN. International Conference on Computer Vision (ICCV), pp.2961-2969. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8237584> (Accessed 19 January 2022).
Lorensen, W.E. and Cline, H.E., 1987. Marching cubes: A high resolution 3D surface construction algorithm. In: Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp.163-169. Available at: <https://doi.org/10.1145/37401.37422> (Accessed 19 January 2022).
Lafferty, J., McCallum, A. and Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp.282-289. Available at: <https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers> (Accessed 19 January 2022).
Hong, F. et al., 2021. Lidar-based panoptic segmentation via dynamic shifting network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.13090-13099. Available at: <https://openaccess.thecvf.com/content/CVPR2021/papers/Hong_LiDAR-Based_Panoptic_Segmentation_via_Dynamic_Shifting_Network_CVPR_2021_paper.pdf> (Accessed 19 January 2022).
Comaniciu, D. and Meer, P., 2002. Mean shift: A robust approach toward feature space analysis. In: IEEE Transactions on pattern analysis and machine intelligence, 24(5), pp.603-619. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1000236> (Accessed 19 January 2022).
Wu, S.-C. et al., 2021. SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.7515-7525. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9578559> (Accessed 19 January 2022).
Wald, J., Dhamo, H., Navab, N. and Tombari, F., 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.3961-3970. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9156565> (Accessed 19 January 2022).
Tateno, K., Tombari, F. and Navab, N., 2015. Real-time and scalable incremental segmentation on dense slam. In: IEEE International Workshop on Intelligent Robots and Systems (IROS), pp.4465-4472. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7354011> (Accessed 19 January 2022).
Qin, C., Zhang, Y., Liu, Y. and Lv, G., 2021. Semantic loop closure detection based on graph matching in multi-objects scenes. Journal of Visual Communication and Image Representation, 76(103072). DOI: 10.1016/j.jvcir.2021.103072.
GitHub, u.d. 3Rscan. Available at: <https://github.com/WaldJohannaU/3RScan> (Accessed 26 January 2022).
SceneNN, u.d. A Scene Meshes Dataset with aNNotations. Available at: http://103.24.77.34/scenenn/home/ (Accessed 26 January 2022).
Chang, A., u.d. Matterport3D: Learning from RGB-D Data in Indoor Environments. Available at: <https://niessner.github.io/Matterport/> (Accessed 26 January 2022).
GitHub, u.d. Replica Dataset. Available at: <https://github.com/facebookresearch/Replica-Dataset> (Accessed 26 January 2022).
Nießner, M., Zollhöfer, M., Izadi, S. and Stamminger, M., 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, 32(6). Available at: <https://doi.org/10.1145/2508363.2508374> (Accessed 19 January 2022).
Muglikar, M., Zhang, Z. and Scaramuzza, D., 2020. Voxel map for visual slam. In: IEEE International Conference on Robotics and Automation (ICRA), pp.4181-4187. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9197357> (Accessed 19 January 2022).
Kirillov, A. et al., 2019. Panoptic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.9404-9413. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8953237> (Accessed 19 January 2022).
Ji, J., Krishna, R., Fei-Fei, L. and Niebles, J.C., 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.10236-10247. Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9157115> (Accessed 19 January 2022).
Arase, K., Mukuta, Y. and Harada, T., 2019. Rethinking task and metrics of instance segmentation on 3D point clouds. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). Available at: <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9022256 9157115> (Accessed 19 January 2022).