Multimodal masked autoencoder in latent space for 3D object detection

Nikolay S. Filatov
Peter the Great Saint Petersburg Polytechnical University (SPbPU), 29, Politekhnicheskaya ul., Saint Petersburg, 195251, Russia; Artificial Intelligence Software Engineer, Celsus (LLC «Medical Screening Systems»), 63, ul. Zhukovskogo, Saint Petersburg, 191014, Russia, This email address is being protected from spambots. You need JavaScript enabled to view it., ORCID: 0000-0002-0657-1256

UDC identifier: 004.896

EDN: UUJLHI

Abstract. In autonomous driving and robotics, 3D object detection plays a critical role, but not only high accuracy but also prediction speed and resilience to sensor failures are important. Existing solutions, including LiDAR-based or camera-only methods, often fail to meet all three requirements simultaneously. In this paper, we propose an improvement of the multimodal 3D object detection method using a multimodal masked autoencoder in the latent feature space. Unique, task-specific masking and reconstruction strategies are developed. Experiments on the nuScenes dataset demonstrate that the proposed approach outperforms previous performance-optimized solutions in terms of accuracy metrics (mAP, NDS), maintains high performance (up to 8.23 Hz on RTX 3060), and shows greater resilience to various sensor failure scenarios.

Key words: 3d object detection, masked autoencoder, neural networks, autonomous driving

For citation: Filatov, N.S. (2025), "Multimodal masked autoencoder in latent space for 3D object detection", Robotics and Technical Cybernetics, vol. 13, no. 4, pp. 301-308, EDN: UUJLHI. (in Russian).

References

Yan, J., Liu, Y., Sun, J., Jia, F. et al. (2023), “Cross modal transformer: Towards fast and robust 3d object detection”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18268–18278, DOI: 10.1109/ICCV51070.2023.01675
Filatov, N., and Potekhin, R. (2024), “Continuous Token Partitioning for Real-Time Multi-modal 3d Object Detection”, In International Conference on Neuroinformatics. – Cham : Springer Nature Switzerland, pp. 426–437, DOI: 10.1007/978-3-031-80463-2_40
Wang, H., Tang, H., Shi, S., Li, A. et al. (2023), “Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6792–6802, DOI: 10.1109/ICCV51070
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arXiv, DOI: 10.48550/arXiv.1810.04805
Bao, H., Dong, L., Piao, S., and Wei, F. (2022), “BEiT: BERT Pre-Training of Image Transformers”, arXiv, DOI: 10.48550/arXiv.2106.08254
He, K., Chen, X., Xie, S., Li, Y. et al. (2022), “Masked autoencoders are scalable vision learners”, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, DOI: 10.1109/CVPR52688.2022.01553
Gao, P., Ma, T., Li, H., Lin, Z. et al. (2022), “ConvMAE: Masked Convolution Meets Masked Autoencoders”, arXiv, DOI: 10.48550/arXiv.2205.03892
Xie, G., Li, Y., Qu, H. and Sun, Z. (2022), “Masked Autoencoder for Pre-Training on 3D Point Cloud Object Detection”, Mathematics, 10(19), 3549.
Chen, A., Zhang, K., Zhang, R., Wang, Z. et al. (2023), “Pimae: Point cloud and image interactive masked autoencoders for 3d object detection”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5291–5301, DOI: 10.1109/CVPR52729.2023.00512
Zhang, Y., Chen, J. and Huang, D. (2024), “Cmae-3d: Contrastive masked autoencoders for self-supervised 3d object detection”, International Journal of Computer Vision, vol. 133, pp. 2783–2804, DOI: 10.1007/s11263-024-02313-2
Liu, Z., Tang, H., Amini, A., Yang, X. et al. (2023), “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation”, In 2023 IEEE international conference on robotics and automation (ICRA), pp. 2774–2781, DOI: 10.1109/ICRA48891.2023.10160968
Yang, Z., Chen, J., Miao, Z., Li, W. et al. (2022), “Deepinteraction: 3d object detection via modality interaction”, Advances in Neural Information Processing Systems, 35, pp. 1992–2005, DOI:10.48550/arXiv.2208.11112
Yan, Y., Mao, Y. and Li, B. (2018), “Second: Sparsely embedded convolutional detection”, Sensors, 18(10), 3337.
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J. and Auli, M. (2022), “Data2vec: A general framework for self-supervised learning in speech, vision and language”, In International Conference on Machine Learning, pp. 1298–1312, DOI:10.48550/arXiv.2202.03555
Caesar, H., Bankiti, V., Lang, A. H., Vora, S. et al. (2019). “nuscenes: A multimodal dataset for autonomous driving”, arXiv, DOI: 10.48550/arXiv.1903.11027
Wang, H., Tang, H., Shi, S., Li, A. et al. (2023), “UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation”, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6792–6802, DOI: 10.1109/ICCV51070

Received 03.03.2025
Revised 24.04.2025
Accepted 31.08.2025