Unlocking AI's Potential: Zero-Shot and Few-Shot Learning for Voice and Image Recognition

Adnan  Ali; Shoaib  Farooq; Muhammad Zeeshan   Shafi; Muhammad Talha Tahir  Bajwa; Jamil Ur  Rehman; Hanifullah

doi:10.62019/kcsndc55

Authors

Adnan Ali Department of Computer Science National College of Business Administration & Economics Lahore Sub Campus Bahawalpur, Pakistan.
Shoaib Farooq National University of Computer and Emerging Sciences (NUCES), Islamabad
Muhammad Zeeshan Shafi Department of Computer Science The Islamia University of Bahwalpur, Pakistan.
Muhammad Talha Tahir Bajwa Department of Computer Science , University of Agriculture Faisalabad, Pakistan.
Jamil Ur Rehman Group Head / Senior General Manager (IT) SSGCL, Department of IT
Hanifullah Department of Computer science, Institute of business management sciences, Agriculture University peshawar , Pakistan.

DOI:

https://doi.org/10.62019/kcsndc55

Abstract

The zero-shot and few-shot learning paradigms have emerged as promising solutions to the shortcomings of traditional deep models that require large volumes of labeled data for training. In the present paper, a full-fledged experimental study of the usage of zero-shot and few-shot learning methods in voice and image recognition tasks is provided. We analyze and compare several state-of-the-art architectures, such as CLIP, Whisper, and prototypical networks, in benchmark datasets including ESC-50 for audio classification and mini-ImageNet for image recognition. The experiments are designed to evaluate the generalization ability of models in situations where classes are unseen or sparsely represent during training. Our results show that multimodal models that are pre-trained on large-scale datasets have a high rate of performance in zero-shot scenarios, whereas metric-based few-shot approaches allow greater accuracy when only a small amount of supervision is provided. We also discuss cross-modal transfer ability, and examine how acquired representations in one modality (e.g., voice) can be used in the other (e.g., images). The findings highlight critical trade-offs between model complexity, data efficiency, and recognition accuracy which provide practical information to the deployment of lightweight and scalable AI systems in resource-limited settings. The paper develops the concept of generalized recognition systems and provides a base on how the concept will be researched on in the future in a low-resource learning environment. The main contributions of this work include a comparative study of ZSL and FSL methods, analysis of cross-modal transfer and identification of key trade-offs that lay the foundation for future research in generalized and adaptive AI.