ZebraPoseNet: A Deep UNet-Based Framework with Attention Mechanisms for Animal Pose Estimation
DOI:
https://doi.org/10.54097/d4chza36Keywords:
Animal Pose Estimation, SE-Enhanced UNet, Transfer Learning, Intelligent Environmental Sensing.Abstract
Animal pose estimation plays a crucial role in wildlife monitoring, behavioral analysis, and conservation research. However, zebras present unique challenges due to their visually complex striped patterns, frequent occlusions in natural environments, and the scarcity of annotated datasets. Conventional deep learning models, such as UNet, struggle with feature extraction and keypoint localization in such scenarios, leading to reduced accuracy and generalization issues. In this study, we propose an improved UNet model that incorporates Squeeze-and-Excitation (SE) attention mechanisms and transfer learning to enhance keypoint detection accuracy. The SE blocks allow the model to emphasize important spatial features while suppressing background noise, and the transfer learning approach leverages knowledge from larger animal pose datasets to improve performance on limited zebra data. We evaluate our model on a custom-labeled zebra dataset and optimize it with a hybrid loss function. Experimental results demonstrate that our approach significantly reduces the mean per-keypoint error (MPKE) by 15% compared to the baseline UNet model, highlighting its effectiveness in real-world applications.
Downloads
References
[1] Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3686-3693.
[2] Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2019). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43 (1), 172-186.
[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7132-7141.
[4] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. European Conference on Computer Vision (ECCV), 740-755.
[5] Luvizon, D. C., Picard, D., & Tabia, H. (2019). Human pose regression by combining indirect part detection and contextual information. Pattern Recognition, 94, 54-64.
[6] Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., & Bethge, M. (2018). DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21 (9), 1281-1289.
[7] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 234-241.
[8] Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5693-5703.
[9] Zuffi, S., Kanazawa, A., Jacobs, D. W., & Black, M. J. (2018). 3D menagerie: Modeling the 3D shape and pose of animals. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5524-5532.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Highlights in Science, Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







