ZebraPoseNet: A Deep UNet-Based Framework with Attention Mechanisms for Animal Pose Estimation

Hongqian Yu

doi:10.54097/d4chza36

Authors

Hongqian Yu

DOI:

https://doi.org/10.54097/d4chza36

Keywords:

Animal Pose Estimation, SE-Enhanced UNet, Transfer Learning, Intelligent Environmental Sensing.

Abstract

Animal pose estimation plays a crucial role in wildlife monitoring, behavioral analysis, and conservation research. However, zebras present unique challenges due to their visually complex striped patterns, frequent occlusions in natural environments, and the scarcity of annotated datasets. Conventional deep learning models, such as UNet, struggle with feature extraction and keypoint localization in such scenarios, leading to reduced accuracy and generalization issues. In this study, we propose an improved UNet model that incorporates Squeeze-and-Excitation (SE) attention mechanisms and transfer learning to enhance keypoint detection accuracy. The SE blocks allow the model to emphasize important spatial features while suppressing background noise, and the transfer learning approach leverages knowledge from larger animal pose datasets to improve performance on limited zebra data. We evaluate our model on a custom-labeled zebra dataset and optimize it with a hybrid loss function. Experimental results demonstrate that our approach significantly reduces the mean per-keypoint error (MPKE) by 15% compared to the baseline UNet model, highlighting its effectiveness in real-world applications.

Downloads

Download data is not yet available.

References

[1] Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3686-3693.

[2] Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2019). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43 (1), 172-186.

[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7132-7141.

[4] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. European Conference on Computer Vision (ECCV), 740-755.

[5] Luvizon, D. C., Picard, D., & Tabia, H. (2019). Human pose regression by combining indirect part detection and contextual information. Pattern Recognition, 94, 54-64.

[6] Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., & Bethge, M. (2018). DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21 (9), 1281-1289.

[7] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 234-241.

[8] Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5693-5703.

[9] Zuffi, S., Kanazawa, A., Jacobs, D. W., & Black, M. J. (2018). 3D menagerie: Modeling the 3D shape and pose of animals. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5524-5532.