Nikolay S. Filatov
Peter the Great Saint Petersburg Polytechnical University (SPbPU), 29, Politekhnicheskaya ul., Saint Petersburg, 195251, Russia; Artificial Intelligence Software Engineer, Celsus (LLC «Medical Screening Systems»), 63, ul. Zhukovskogo, Saint Petersburg, 191014, Russia, This email address is being protected from spambots. You need JavaScript enabled to view it., ORCID: 0000-0002-0657-1256
UDC identifier: 004.896
EDN: UUJLHI
Abstract. In autonomous driving and robotics, 3D object detection plays a critical role, but not only high accuracy but also prediction speed and resilience to sensor failures are important. Existing solutions, including LiDAR-based or camera-only methods, often fail to meet all three requirements simultaneously. In this paper, we propose an improvement of the multimodal 3D object detection method using a multimodal masked autoencoder in the latent feature space. Unique, task-specific masking and reconstruction strategies are developed. Experiments on the nuScenes dataset demonstrate that the proposed approach outperforms previous performance-optimized solutions in terms of accuracy metrics (mAP, NDS), maintains high performance (up to 8.23 Hz on RTX 3060), and shows greater resilience to various sensor failure scenarios.
Key words: 3d object detection, masked autoencoder, neural networks, autonomous driving
For citation: Filatov, N.S. (2025), "Multimodal masked autoencoder in latent space for 3D object detection", Robotics and Technical Cybernetics, vol. 13, no. 4, pp. 301-308, EDN: UUJLHI. (in Russian).
References
- Yan, J., Liu, Y., Sun, J., Jia, F. et al. (2023), “Cross modal transformer: Towards fast and robust 3d object detection”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278, DOI: 10.1109/ICCV51070.2023.01675
- Filatov, N., and Potekhin, R. (2024), “Continuous Token Partitioning for Real-Time Multi-modal 3d Object Detection”, In International Conference on Neuroinformatics. – Cham : Springer Nature Switzerland, pp. 426–437, DOI: 10.1007/978-3-031-80463-2_40
- Wang, H., Tang, H., Shi, S., Li, A. et al. (2023), “Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6792–6802, DOI: 10.1109/ICCV51070
- Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arXiv, DOI: 10.48550/arXiv.1810.04805
- Bao, H., Dong, L., Piao, S., and Wei, F. (2022), “BEiT: BERT Pre-Training of Image Transformers”, arXiv, DOI: 10.48550/arXiv.2106.08254
- He, K., Chen, X., Xie, S., Li, Y. et al. (2022), “Masked autoencoders are scalable vision learners”, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, DOI: 10.1109/CVPR52688.2022.01553
- Gao, P., Ma, T., Li, H., Lin, Z. et al. (2022), “ConvMAE: Masked Convolution Meets Masked Autoencoders”, arXiv, DOI: 10.48550/arXiv.2205.03892
- Xie, G., Li, Y., Qu, H. and Sun, Z. (2022), “Masked Autoencoder for Pre-Training on 3D Point Cloud Object Detection”, Mathematics, 10(19), 3549.
- Chen, A., Zhang, K., Zhang, R., Wang, Z. et al. (2023), “Pimae: Point cloud and image interactive masked autoencoders for 3d object detection”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5291–5301, DOI: 10.1109/CVPR52729.2023.00512
- Zhang, Y., Chen, J. and Huang, D. (2024), “Cmae-3d: Contrastive masked autoencoders for self-supervised 3d object detection”, International Journal of Computer Vision, vol. 133, pp. 2783–2804, DOI: 10.1007/s11263-024-02313-2
- Liu, Z., Tang, H., Amini, A., Yang, X. et al. (2023), “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation”, In 2023 IEEE international conference on robotics and automation (ICRA), pp. 2774–2781, DOI: 10.1109/ICRA48891.2023.10160968
- Yang, Z., Chen, J., Miao, Z., Li, W. et al. (2022), “Deepinteraction: 3d object detection via modality interaction”, Advances in Neural Information Processing Systems, 35, pp. 1992–2005, DOI:10.48550/arXiv.2208.11112
- Yan, Y., Mao, Y. and Li, B. (2018), “Second: Sparsely embedded convolutional detection”, Sensors, 18(10), 3337.
- Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J. and Auli, M. (2022), “Data2vec: A general framework for self-supervised learning in speech, vision and language”, In International Conference on Machine Learning, pp. 1298–1312, DOI:10.48550/arXiv.2202.03555
- Caesar, H., Bankiti, V., Lang, A. H., Vora, S. et al. (2019). “nuscenes: A multimodal dataset for autonomous driving”, arXiv, DOI: 10.48550/arXiv.1903.11027
- Wang, H., Tang, H., Shi, S., Li, A. et al. (2023), “UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6792–6802, DOI: 10.1109/ICCV51070
Received 03.03.2025
Revised 24.04.2025
Accepted 31.08.2025