Posts by Collection

portfolio

Portfolio item number 1

Published: January 19, 2026

Portfolio item number 2

Published: January 19, 2026

publications

Hardware Architecture for High Radix Adaptive CORDIC Algorithm

Published in National Institute of Technology Karnataka Surathkal, 2015

This thesis presents a hardware architecture for an adaptive CORDIC (COordinate Rotation DIgital Computer) algorithm. The proposed architecture implements a high radix variant that allows for faster convergence compared to traditional fixed radix implementations. The design demonstrates efficient computation of trigonometric, hyperbolic, and other mathematical functions using shift and add operations optimized for FPGA implementation.

BibTeX

@mastersthesis{shah2015hardware,
  title={Hardware Architecture for High Radix Adaptive CORDIC Algorithm},
  author={Shah, Ankit and Oza, Saharsh Samir and Thokala, Tarun and Gujjar, Pratik and David, Sumam},
  school={National Institute of Technology Karnataka Surathkal},
  year={2015}
}

Pipelined implementation of high radix adaptive CORDIC as a coprocessor

Published in 2015 International Conference on Computing and Network Communications (CoCoNet), 2015

The Coordinate Rotational Digital Computer (CORDIC) algorithm allows computation of trigonometric, hyperbolic, natural log and square root functions. This iterative algorithm uses only shift and add operations to converge. Multiple fixed radix variants of the algorithm have been implemented on hardware. These have demonstrated faster convergence at the expense of reduced accuracy. High radix adaptive variants of CORDIC also exist in literature. These allow for faster convergence at the expense of hardware multipliers in the datapath without compromising on the accuracy of the results. This paper proposes a 12 stage deep pipeline architecture to implement a high radix adaptive CORDIC algorithm. It employs floating point multipliers in place of the conventional shift and add architecture of fixed radix CORDIC. This design has been synthesised on a FPGA board to act as a coprocessor. The paper also studies the power, latency and accuracy of this implementation.

BibTeX

@inproceedings{oza2015pipelined,
  title={Pipelined implementation of high radix adaptive CORDIC as a coprocessor},
  author={Oza, Saharsh Samir and Shah, Ankit P and Thokala, Tarun and David, Sumam},
  booktitle={2015 International Conference on Computing and Network Communications (CoCoNet)},
  pages={333--342},
  year={2015},
  organization={IEEE}
}

Repeatability and Scalability of Code at Top level Verification

Published in Regional Engineering Conference 2016, 2016

This paper discusses methodologies and best practices for ensuring repeatable and scalable verification code at the top level of hardware design verification. We present techniques for maintaining consistency across verification runs while enabling the codebase to scale with increasing design complexity.

BibTeX

@inproceedings{shah2016repeatability,
  title={Repeatability and Scalability of Code at Top level Verification},
  author={Shah, Ankit and Bhat, Ajith and Mantri, Rashmin and Saxena, Saurabh and Rishiraj and Shanavas},
  booktitle={Regional Engineering Conference},
  year={2016}
}

Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording

Published in IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events, 2016

In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.76 compared to the baseline of 0.91.

BibTeX

@inproceedings{elizalde2016experiments,
  title={Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording},
  author={Elizalde, Benjamin and Kumar, Anurag and Shah, Ankit and Badlani, Rohan and Vincent, Emmanuel and Raj, Bhiksha and Lane, Ian},
  booktitle={IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events},
  year={2016}
}

An approach for self-training audio event detectors using web data

Published in arXiv preprint arXiv:1609.06026, 2016

We propose a self-training approach for audio event detectors that leverages web data. Our method automatically collects and curates training data from the web, enabling the development of robust audio event detection systems without requiring extensive manual annotation. This early work laid the foundation for subsequent research on weakly-supervised and web-based learning for audio understanding.

BibTeX

@article{shah2016approach,
  title={An approach for self-training audio event detectors using web data},
  author={Shah, Ankit and Badlani, Rohan and Kumar, Anurag and Elizalde, Benjamin and Raj, Bhiksha},
  journal={arXiv preprint arXiv:1609.06026},
  year={2016}
}

An Approach for Self Training Audio Event Detectors using Web Data

Published in 25th European Signal Processing Conference (EUSIPCO), 2017

Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was used to re-train the detectors. The performance of the re-trained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos.

BibTeX

@inproceedings{shah2017approach,
  title={An Approach for Self-Training Audio Event Detectors Using Web Data},
  author={Shah, Ankit and Badlani, Rohan and Kumar, Anurag and Elizalde, Benjamin and Raj, Bhiksha},
  booktitle={25th European Signal Processing Conference (EUSIPCO)},
  pages={1863--1867},
  year={2017},
  organization={IEEE}
}

DCASE 2017 challenge setup: Tasks, datasets and baseline system

Published in Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

BibTeX

@inproceedings{mesaros2017dcase,
  title={DCASE 2017 challenge setup: Tasks, datasets and baseline system},
  author={Mesaros, Annamaria and Heittola, Toni and Diment, Aleksandr and Elizalde, Benjamin and Shah, Ankit and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas},
  booktitle={DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events},
  year={2017}
}

NELS - Never-Ending Learner of Sounds

Published in Machine Learning for Audio Signal Processing Workshop, NIPS 2017, 2017

Sounds are essential to how humans perceive and interact with the world. These sounds are captured in recordings and shared on the Internet on a minute-by-minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we’ve ever seen. However, most of these recordings have undescribed content making necessary methods for automatic audio content analysis, indexing and retrieval. These methods have to address multiple challenges, such as the relation between sounds and language, numerous and diverse sound classes, and large-scale evaluation. We propose a system that continuously learns from the web relations between sounds and language, improves sound recognition models over time and evaluates its learning competency in the large-scale without references. We introduce the Never-Ending Learner of Sounds (NELS), a project for continuously learning of sounds and their associated knowledge.

BibTeX

@inproceedings{elizalde2017nels,
  title={NELS - Never-Ending Learner of Sounds},
  author={Elizalde, Benjamin and Badlani, Rohan and Shah, Ankit and Kumar, Anurag and Raj, Bhiksha},
  booktitle={Machine Learning for Audio Signal Processing Workshop, NIPS 2017},
  year={2017}
}

Content-based Representations of audio using Siamese neural networks

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

In this paper, we focus on the problem of content-based retrieval for audio, which aims to retrieve all semantically similar audio recordings for a given audio clip query. We propose a novel approach which encodes the audio into a vector representation using Siamese Neural Networks. The goal is to obtain an encoding similar for files belonging to the same audio class, thus allowing retrieval of semantically similar audio. We used two similarity measures, Cosine similarity and Euclidean distance, to show that our method is effective in retrieving files similar in audio content. Our results indicate that our neural network-based approach is able to retrieve files similar in content and semantics.

BibTeX

@inproceedings{manocha2018content,
  title={Content-based Representations of Audio Using Siamese Neural Networks},
  author={Manocha, Pranay and Badlani, Rohan and Kumar, Anurag and Shah, Ankit and Elizalde, Benjamin and Raj, Bhiksha},
  booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={3136--3140},
  year={2018},
  organization={IEEE}
}

Framework for evaluation of sound event detection in web videos

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found to their metadata by the search engine. In this paper we explore the extent to which a search query could be used as the true label for the presence of sound events in the videos. For this, we developed a framework for large-scale sound event recognition on web videos. The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets. The datasets are used to train three classifiers, which were then run on 3.7 million video segments. We evaluated performance using the search query as the true label and compare it (on a subset) with human labeling. Both types exhibited close performance, to within 10%, and similar performance trends as the number of evaluated segments increased. Hence, our experiments show potential for using search query as a preliminary true label for sound events in web videos.

BibTeX

@inproceedings{badlani2018framework,
  title={Framework for Evaluation of Sound Event Detection in Web Videos},
  author={Badlani, Rohan and Shah, Ankit and Elizalde, Benjamin and Kumar, Anurag and Raj, Bhiksha},
  booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={761--765},
  year={2018},
  organization={IEEE}
}

A Closer Look at Weak Label Learning for Audio Events

Published in arXiv preprint arXiv:1804.09288, 2018

Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments.

BibTeX

@article{shah2018closer,
  title={A Closer Look at Weak Label Learning for Audio Events},
  author={Shah, Ankit and Kumar, Anurag and Hauptmann, Alexander G and Raj, Bhiksha},
  journal={arXiv preprint arXiv:1804.09288},
  year={2018}
}

Natural Language Person Search Using Deep Reinforcement Learning

Published in arXiv preprint arXiv:1809.00365, 2018

We present a method for person search in videos using natural language descriptions and deep reinforcement learning. Our approach formulates person search as a sequential decision-making problem where an agent learns to efficiently navigate through video frames to locate individuals matching natural language queries. The reinforcement learning framework enables the model to develop effective search strategies that balance exploration and exploitation, resulting in faster and more accurate person retrieval compared to exhaustive search baselines.

BibTeX

@article{shah2018natural,
  title={Natural Language Person Search Using Deep Reinforcement Learning},
  author={Shah, Ankit and Vuong, Tyler},
  journal={arXiv preprint arXiv:1809.00365},
  year={2018}
}

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Published in arXiv preprint arXiv:1809.00241, 2018

We present a large-scale study on activity recognition using the Moments in Time dataset, which contains over one million short video clips spanning a diverse set of human activities. Our approach combines visual and temporal features to classify activities in 3-second video clips. We analyze the challenges of recognizing fine-grained activities at scale and propose methods for handling the inherent class imbalance and visual complexity present in real-world video data.

BibTeX

@article{shah2018activity,
  title={Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset},
  author={Shah, Ankit and Kesavamoorthy, Harini and Rane, Poorva and Kalwad, Pramati and Hauptmann, Alexander and Metze, Florian},
  journal={arXiv preprint arXiv:1809.00241},
  year={2018}
}

CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis

Published in IEEE International Workshop on Traffic and Street Surveillance for Safety and Security (AVSS), 2018

This paper presents a novel dataset for traffic accidents analysis. Our goal is to resolve the lack of public data for research about automatic spatio-temporal annotations for traffic safety in the roads. Through the analysis of the proposed dataset, we observed a significant degradation of object detection in pedestrian category in our dataset, due to the object sizes and complexity of the scenes. To this end, we propose to integrate contextual information into conventional Faster R-CNN using Context Mining (CM) and Augmented Context Mining (ACM) to complement the accuracy for small pedestrian detection. Our experiments indicate a considerable improvement in object detection accuracy: +8.51% for CM and +6.20% for ACM. Finally, we demonstrate the performance of accident forecasting in our dataset using Faster R-CNN and an Accident LSTM architecture. We achieved an average of 1.684 seconds in terms of Time-To-Accident measure with an Average Precision of 47.25%.

BibTeX

@inproceedings{shah2018cadp,
  title={CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis},
  author={Shah, Ankit and Lamare, Jean Baptiste and Nguyen-Anh, Tuan and Hauptmann, Alexander},
  booktitle={2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)},
  pages={1--9},
  year={2018},
  organization={IEEE}
}

Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

Published in Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events) and potential industrial applications.

BibTeX

@inproceedings{serizel2018large,
  title={Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments},
  author={Serizel, Romain and Turpault, Nicolas and Eghbal-Zadeh, Hamid and Shah, Ankit Parag},
  booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)},
  pages={19--23},
  year={2018}
}

Tartan: A retrieval-based socialbot powered by a dynamic finite-state machine architecture

Published in 2nd Proceedings of Alexa Prize (Alexa Prize 2018), 2018

This paper describes the Tartan conversational agent built for the 2018 Alexa Prize Competition. Tartan is a non-goal-oriented socialbot focused around providing users with an engaging and fluent casual conversation. Tartan’s key features include an emphasis on structured conversation based on flexible finite-state models and an approach focused on understanding and using conversational acts. To provide engaging conversations, Tartan blends script-like yet dynamic responses with data-based generative and retrieval models. Unique to Tartan is that our dialog manager is modeled as a dynamic Finite State Machine. To our knowledge, no other conversational agent implementation has followed this specific structure.

BibTeX

@article{larionov2018tartan,
  title={Tartan: A Retrieval-Based Socialbot Powered by a Dynamic Finite-State Machine Architecture},
  author={Larionov, George and Kaden, Zachary and Dureddy, Hima Varsha and Kalejaiye, Gabriel Bayomi T and Kale, Mihir and Potharaju, Srividya Pranavi and Shah, Ankit Parag and Rudnicky, Alexander I},
  journal={arXiv preprint arXiv:1812.01260},
  year={2018}
}

Learning Sound Events From Webly Labeled Data

Published in 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019

In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection. In this work, we introduce webly labeled learning for sound events in which we aim to remove human supervision altogether from the learning process. We first develop a method of obtaining labeled audio data from the web (albeit noisy), in which no manual labeling is involved. We then describe deep learning methods to efficiently learn from these webly labeled audio recordings. In our proposed system, WeblyNet, two deep neural networks co-teach each other to robustly learn from webly labeled data, leading to around 17% relative improvement over the baseline method. The method also involves transfer learning to obtain efficient representations.

BibTeX

@inproceedings{kumar2019learning,
  title={Learning Sound Events From Webly Labeled Data},
  author={Kumar, Anurag and Shah, Ankit and Hauptmann, Alexander G and Raj, Bhiksha},
  booktitle={Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI)},
  pages={2772--2778},
  year={2019}
}

Multimodal Behavioral Markers Exploring Suicidal Intent in Social Media Videos

Published in 21st ACM International Conference on Multimodal Interaction (ICMI), 2019

BibTeX

@inproceedings{shah2019multimodal,
  title={Multimodal Behavioral Markers Exploring Suicidal Intent in Social Media Videos},
  author={Shah, Ankit and Sharma, Vasu and Vaibhav, Vaibhav and Alismail, Mahmoud and Morency, Louis-Philippe},
  booktitle={2019 International Conference on Multimodal Interaction (ICMI)},
  pages={409--413},
  year={2019},
  organization={ACM}
}

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Published in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a follow-up to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. We introduce the Domestic Environment Sound Event Detection (DESED) dataset, mixing a part of last year’s dataset and an additional synthetic, strongly labeled, dataset provided this year that we describe in more detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

BibTeX

@inproceedings{turpault2019sound,
  title={Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis},
  author={Turpault, Nicolas and Serizel, Romain and Shah, Ankit Parag and Salamon, Justin},
  booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)},
  pages={253--257},
  year={2019}
}

CMU Sounds for COVID Project

Published in CMU Language Technologies Institute, 2020

The CMU Sounds for COVID Project is a research initiative aimed at developing audio-based detection methods for COVID-19. The project collects and analyzes voice and respiratory sounds to identify acoustic biomarkers that may be indicative of COVID-19 infection. This work contributes to the broader effort of developing non-invasive screening tools for respiratory illnesses using machine learning and signal processing techniques.

BibTeX

@misc{raj2020cmusounds,
  title={CMU Sounds for COVID Project},
  author={Raj, Bhiksha and Singh, Rita and Shah, Ankit and Striner, Benjamin and Memon, Shahan Ali and Sanil, Vedant and Deshmukh, Soham and Kangrui, Ruan and Al Ismail, Mahmoud and Dhamyal, Hira and Sakr, Majd and Wolfe, Nicholas and Ur, Shmuel},
  howpublished={\url{https://node.dev.cvd.lti.cmu.edu/}},
  year={2020}
}

Sound event detection in synthetic domestic environments

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.

BibTeX

@inproceedings{serizel2020sound,
  title={Sound Event Detection in Synthetic Domestic Environments},
  author={Serizel, Romain and Turpault, Nicolas and Shah, Ankit and Salamon, Justin},
  booktitle={2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={86--90},
  year={2020},
  organization={IEEE}
}

Training image classifiers using Semi-Weak Label Data

Published in arXiv preprint arXiv:2103.10608, 2021

In Multiple Instance learning (MIL), weak labels are provided at the bag level with only presence/absence information known. However, there is a considerable gap in performance in comparison to a fully supervised model, limiting the practical applicability of MIL approaches. Thus, this paper introduces a novel semi-weak label learning paradigm as a middle ground to mitigate the problem. We define semi-weak label data as data where we know the presence or absence of a given class and the exact count of each class as opposed to knowing the label proportions. We then propose a two-stage framework to address the problem of learning from semi-weak labels. It leverages the fact that counting information is non-negative and discrete. Experiments are conducted on generated samples from CIFAR-10. We compare our model with a fully-supervised setting baseline, a weakly-supervised setting baseline and learning from proportion (LLP) baseline. Our framework not only outperforms both baseline models for MIL-based weakly supervised setting and learning from proportion setting, but also gives comparable results compared to the fully supervised model.

BibTeX

@article{zhang2021training,
  title={Training Image Classifiers Using Semi-Weak Label Data},
  author={Zhang, Anxiang and Shah, Ankit and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2103.10608},
  year={2021}
}

Feature extraction and evaluation for BioMedical Question Answering

Published in arXiv preprint arXiv:2105.14013, 2021

This work presents a comprehensive study on feature extraction techniques for biomedical question answering tasks. We explore various feature representations including traditional NLP features and modern transformer-based embeddings for handling complex biomedical terminology and domain-specific language. Our evaluation covers multiple benchmark datasets and provides insights into the most effective feature extraction strategies for improving QA performance in the biomedical domain.

BibTeX

@article{shah2021feature,
  title={Feature Extraction and Evaluation for BioMedical Question Answering},
  author={Shah, Ankit and Singh, Srishti and Tao, Shih-Yen},
  journal={arXiv preprint arXiv:2105.14013},
  year={2021}
}

An overview of techniques for biomarker discovery in voice signal

Published in arXiv preprint arXiv:2110.04678, 2021

This paper provides a comprehensive survey of voice-based biomarker discovery methods. We review the current state of research on extracting health-related information from voice signals, including techniques for detecting various medical conditions such as respiratory diseases, neurological disorders, and mental health issues. The survey covers signal processing methods, machine learning approaches, and deep learning techniques that have been applied to identify acoustic biomarkers in human voice.

BibTeX

@article{singh2021overview,
  title={An Overview of Techniques for Biomarker Discovery in Voice Signal},
  author={Singh, Rita and Shah, Ankit and Dhamyal, Hira},
  journal={arXiv preprint arXiv:2110.04678},
  year={2021}
}

Reasoning for Audio Visual Scene-Aware Dialog Track in DSTC10

Published in Dialog System Technology Challenge 10 (DSTC10), 2021

This paper presents our approach to the Audio Visual Scene-Aware Dialog (AVSD) track in DSTC10. We introduce reasoning methods that enable the dialog system to understand and respond to questions about video content by leveraging both audio and visual modalities. Our approach demonstrates improved temporal reasoning capabilities for grounding dialog responses in specific moments within the video.

BibTeX

@inproceedings{geng2021reasoning,
  title={Reasoning for Audio Visual Scene-Aware Dialog Track in DSTC10},
  author={Geng, Shijie and Gao, Peng and Cherian, Anoop and Marks, Tim K and Hori, Chiori and Shah, Ankit},
  booktitle={Dialog System Technology Challenge 10 (DSTC10)},
  year={2021}
}

Triple Attention Network architecture for MovieQA

Published in arXiv preprint arXiv:2111.09531, 2021

We propose a Triple Attention Network architecture for question answering on movies. Our model incorporates three attention mechanisms that jointly attend to visual frames, subtitles, and plot descriptions to answer complex questions about movie content. The triple attention mechanism enables effective reasoning across multiple modalities, achieving competitive performance on the MovieQA benchmark.

BibTeX

@article{shah2021triple,
  title={Triple Attention Network Architecture for MovieQA},
  author={Shah, Ankit and Lin, Tzu-Hsiang and Wu, Shijie},
  journal={arXiv preprint arXiv:2111.09531},
  year={2021}
}

Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10

Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022

The Audio-Visual Scene-Aware Dialog (AVSD) task was proposed in the Dialog System Technology Challenge (DSTC), where an AVSD dataset was collected and AVSD technologies were developed. An AVSD challenge track was hosted at both the 7th and 8th DSTCs (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, a third AVSD challenge is proposed, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and the new extension of the AVSD dataset for DSTC10, for which human-generated temporal reasoning data were collected. A baseline system was built using an AV-transformer and the new datasets were released for the challenge.

BibTeX

@inproceedings{hori2022overview,
  title={Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10},
  author={Hori, Chiori and Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Le Roux, Jonathan and Marks, Tim K},
  booktitle={Proceedings of DSTC10 Workshop at AAAI-2022},
  year={2022}
}

DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022

We participated in the third challenge for the Audio-Visual Scene-Aware Dialog (AVSD) task in DSTC10. The target of the task was updated by two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. The baseline system built using an AV-transformer was released along with the new dataset including temporal reasoning for DSTC10-AVSD. This paper introduces a new system that extends the baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network (RPN).

BibTeX

@inproceedings{shah2022dstc10,
  title={DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning},
  author={Shah, Ankit Parag and Hori, Takaaki and Le Roux, Jonathan and Hori, Chiori},
  booktitle={Proceedings of DSTC10 Workshop at AAAI-2022},
  year={2022}
}

Ontological Learning from Weak Labels

Published in arXiv preprint arXiv:2203.02483, 2022

Ontologies encompass a formal representation of knowledge through the definition of concepts or properties of a domain, and the relationships between those concepts. In this work, we seek to investigate whether using this ontological information will improve learning from weakly labeled data, which are easier to collect since it requires only the presence or absence of an event to be known. We use the AudioSet ontology and dataset, which contains audio clips weakly labeled with the ontology concepts and the ontology providing the “Is A” relations between the concepts. We first re-implemented the model proposed by soundevent_ontology with modification to fit the multi-label scenario and then expand on that idea by using a Graph Convolutional Network (GCN) to model the ontology information to learn the concepts. We find that the baseline Siamese does not perform better by incorporating ontology information in the weak and multi-label scenario, but that the GCN does capture the ontology knowledge better for weak, multi-labeled data.

BibTeX

@article{tang2022ontological,
  title={Ontological Learning from Weak Labels},
  author={Tang, Larry and Chou, Po Hao and Zheng, Yi Yu and Ge, Ziqian and Shah, Ankit and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2203.02483},
  year={2022}
}

On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice

Published in arXiv preprint arXiv:2204.04802, 2022

Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degrees of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter is not only more accurate and interpretable and also more computationally efficient in that they can be run locally on small devices. We demonstrate this from a human-curated dataset collected and calibrated in clinical settings. On this dataset which comprises over 1000 speakers, a simple binary classifier is able to achieve 94% detection accuracy.

BibTeX

@article{shah2022pragmatism,
  title={On the Pragmatism of Using Binary Classifiers Over Data Intensive Neural Network Classifiers for Detection of COVID-19 from Voice},
  author={Shah, Ankit and Dhamyal, Hira and Gao, Yang and Singh, Rita and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2204.04802},
  year={2022}
}

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Published in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

BibTeX

@inproceedings{shah2022avsd, title={Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning}, author={Shah, Ankit P and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Marks, Tim K and Le Roux, Jonathan and Hori, Chiori}, booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7732--7736}, year={2022}, organization={IEEE} }

Automated Audio Captioning and Language-Based Audio Retrieval

Published in arXiv preprint arXiv:2207.04156, 2022

This work addresses two complementary tasks: automated audio captioning, which generates natural language descriptions for audio clips, and language-based audio retrieval, which retrieves audio clips based on textual queries. We present a unified framework that leverages cross-modal learning between audio and text representations to enable both caption generation and retrieval capabilities. Our approach demonstrates strong performance on standard benchmarks for both tasks.

BibTeX

@article{gomes2022automated,
  title={Automated Audio Captioning and Language-Based Audio Retrieval},
  author={Gomes, Clive and Park, Hyejin and Kollman, Patrick and Song, Yi and Houndayi, Iffanice and Shah, Ankit},
  journal={arXiv preprint arXiv:2207.04156},
  year={2022}
}

Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms

Published in arXiv preprint arXiv:2303.03591, 2023

We propose a novel approach to learning generalized audio representations through batch embedding covariance regularization combined with Constant-Q transforms. Our method encourages the model to learn diverse and decorrelated features by regularizing the covariance matrix of embedding vectors within each batch. We demonstrate that using Constant-Q transforms as input features, which provide logarithmically-spaced frequency resolution similar to human auditory perception, combined with our regularization technique leads to more robust and transferable audio representations across various downstream tasks.

BibTeX

@article{shah2023approach,
  title={Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms},
  author={Shah, Ankit and Chen, Shuyi and Zhou, Kejun and Chen, Yue and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2303.03591},
  year={2023}
}

Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

Published in arXiv preprint arXiv:2303.09048, 2023

Voice over Internet Protocol (VoIP) platforms have become essential for modern communication, yet audio quality often suffers due to network conditions, background noise, and codec limitations. This study explores signal processing and deep learning improvements for VoIP communications, focusing on perceptual quality enhancement, speech intelligibility improvement, and acoustic optimization. We present methods that can be deployed in real-time to significantly improve the user experience in voice calls.

BibTeX

@article{konan2023improving,
  title={Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms},
  author={Konan, Joseph and Bhargave, Ojas and Agnihotri, Shikhar and Lee, Hojeong and Shah, Ankit and Han, Shuo and Zeng, Yunyang and Shu, Amanda and Liu, Haohui and Chang, Xuankai and others},
  journal={arXiv preprint arXiv:2303.09048},
  year={2023}
}

DCASE task 7: Foley sound synthesis

Published in DCASE 2023 Challenge Technical Report, 2023

This technical report describes our submission to DCASE 2023 Task 7 on Foley sound synthesis. The task challenges participants to develop systems capable of synthesizing realistic Foley sounds given semantic class labels. We present our approach using neural audio synthesis techniques and discuss the challenges and lessons learned from the competition.

BibTeX

@techreport{pillay2023dcase,
  title={DCASE Task 7: Foley Sound Synthesis},
  author={Pillay, Ashwin and Betko, Sage and Liloia, Ari and Chen, Hao and Shah, Ankit},
  institution={DCASE 2023 Challenge},
  year={2023}
}

An approach to ontological learning from weak labels

Published in ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

We present an approach to ontological learning from weak labels that leverages hierarchical relationships between audio event classes. Using the AudioSet ontology as our testbed, we demonstrate that incorporating structural knowledge about how concepts relate to each other can improve learning from weakly labeled data. Our method uses graph neural networks to model the ontological relationships and shows improved performance on both parent and child concepts in the audio event hierarchy.

BibTeX

@inproceedings{shah2023approach,
  title={An Approach to Ontological Learning from Weak Labels},
  author={Shah, Ankit and Tang, Larry and Chou, Po Hao and Zheng, Yi Yu and Ge, Ziqian and Raj, Bhiksha},
  booktitle={ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Understanding and mitigating the label noise in pre-training on downstream tasks

Published in arXiv preprint arXiv:2309.17002, 2023

Pre-training on large-scale datasets has become the foundation of modern deep learning, but these datasets often contain significant label noise. This work explores how label noise in pre-training data affects model performance on downstream tasks. We provide theoretical and empirical analysis of the noise propagation mechanism and propose mitigation strategies that can improve the robustness of pre-trained models when transferred to clean downstream tasks.

BibTeX

@article{chen2023understanding,
  title={Understanding and Mitigating the Label Noise in Pre-Training on Downstream Tasks},
  author={Chen, Hao and Wang, Jiahao and Shah, Ankit and Tao, Ran and Wei, Hongxin and Xie, Xing and Sugiyama, Masashi and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2309.17002},
  year={2023}
}

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

Published in arXiv preprint arXiv:2309.04641, 2023

Foley sound synthesis is the art of creating sound effects that match on-screen actions in film and video production. This paper explores domain-specific enhancements for neural Foley sound synthesis systems. We investigate architectural modifications, training strategies, and conditioning mechanisms that can improve the quality and realism of synthesized sounds for specific categories like footsteps, impacts, and environmental sounds.

BibTeX

@article{pillay2023exploring,
  title={Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer},
  author={Pillay, Ashwin and Betko, Sage and Liloia, Ari and Chen, Hao and Shah, Ankit},
  journal={arXiv preprint arXiv:2309.04641},
  year={2023}
}

Online Active Learning For Sound Event Detection

Published in arXiv preprint arXiv:2309.14460, 2023

We propose an online active learning framework for sound event detection that efficiently selects the most informative samples for annotation. Our approach enables continuous model improvement with minimal human labeling effort by identifying uncertain or novel audio segments in streaming data. The method is designed for practical deployment scenarios where labeled data is scarce and annotation resources are limited.

BibTeX

@article{lindsey2023online,
  title={Online Active Learning For Sound Event Detection},
  author={Lindsey, Mark and Shah, Ankit and Kubala, Francis and Stern, Richard M},
  journal={arXiv preprint arXiv:2309.14460},
  year={2023}
}

LoFT: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model

Published in arXiv preprint arXiv:2310.04445, 2023

Large Language Models (LLMs) have demonstrated remarkable capabilities but remain vulnerable to adversarial attacks. This paper introduces LoFT (Local proxy Fine-Tuning), a method for improving the transferability of adversarial attacks against LLMs. By fine-tuning a local proxy model to better approximate the target model’s behavior, we can craft more effective transferable adversarial examples. Our approach provides insights into LLM vulnerabilities and can inform the development of more robust models.

BibTeX

@article{shah2023loft,
  title={LoFT: Local Proxy Fine-Tuning for Improving Transferability of Adversarial Attacks Against Large Language Model},
  author={Shah, Muhammad Awais and Sharma, Rishabh and Dhamyal, Hira and Olivier, Raphael and Shah, Ankit and Konan, Joseph and Alharthi, Dareen and Shirol, Hazim Taha and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2310.04445},
  year={2023}
}

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

Published in arXiv preprint arXiv:2310.07161, 2023

Speech enhancement algorithms for Voice over IP (VoIP) platforms must consider not only objective metrics but also psychoacoustic factors that affect human perception. This paper discusses the psychoacoustic challenges that arise when deploying speech enhancement systems on VoIP platforms, including issues related to perceptual quality, listening comfort, and cognitive load. We provide guidelines for developing speech enhancement systems that are optimized for human listeners in real-world communication scenarios.

BibTeX

@article{konan2023psychoacoustic,
  title={Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms},
  author={Konan, Joseph and Agnihotri, Shikhar and Bhargave, Ojas and Han, Shuo and Zeng, Yunyang and Shah, Ankit and Raj, Bhiksha},
  journal={arXiv preprint arXiv:2310.07161},
  year={2023}
}

Audio-visual fine-tuning of audio-only ASR models

Published in arXiv preprint arXiv:2312.09369, 2023

This paper investigates using visual information from lip movements to fine-tune audio-only automatic speech recognition (ASR) models. We demonstrate that incorporating visual cues during fine-tuning can improve ASR performance, particularly in noisy conditions where audio quality is degraded. Our approach enables leveraging the complementary nature of audio and visual modalities without requiring visual features at inference time.

BibTeX

@article{may2023audiovisual,
  title={Audio-Visual Fine-Tuning of Audio-Only ASR Models},
  author={May, Avner and Serdyuk, Dmitriy and Shah, Ankit Parag and Braga, Otavio and Siohan, Olivier},
  journal={arXiv preprint arXiv:2312.09369},
  year={2023}
}

Overview of the Tenth Dialog System Technology Challenge: DSTC10

Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

This paper provides a comprehensive overview of the Tenth Dialog System Technology Challenge (DSTC10). The challenge featured five tracks covering various aspects of dialog systems including knowledge-grounded dialog, audio-visual scene-aware dialog, and task-oriented dialog. We describe the task definitions, datasets, evaluation metrics, and summarize the approaches and results from participating teams.

BibTeX

@article{yoshino2024overview,
  title={Overview of the Tenth Dialog System Technology Challenge: DSTC10},
  author={Yoshino, Koichiro and Chen, Yun-Nung and Crook, Paul and Kottur, Satwik and Li, Jinchao and Hedayatnia, Behnam and Moon, Seungwhan and Fei, Zhengcong and Li, Zekang and Zhang, Jinchao and others},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={32},
  pages={765--778},
  year={2024},
  publisher={IEEE}
}

Conformer is All You Need for Visual Speech Recognition

Published in ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

Visual speech recognition (VSR) aims to transcribe speech from silent video of a speaker’s face. In this work, we demonstrate that the Conformer architecture, which combines convolutions with self-attention, is highly effective for VSR tasks. Our experiments show that a properly configured Conformer model achieves state-of-the-art performance on standard VSR benchmarks, suggesting that the architectural innovations developed for audio speech recognition transfer well to the visual domain.

BibTeX

@inproceedings{chang2024conformer,
  title={Conformer is All You Need for Visual Speech Recognition},
  author={Chang, Oscar and Liao, Hank and Serdyuk, Dmitriy and Shah, Ankit and Siohan, Olivier},
  booktitle={ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing},
  pages={10501--10505},
  year={2024},
  organization={IEEE}
}

Importance of negative sampling in weak label learning

Published in ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

Weak label learning involves training models when only presence or absence of classes is known at the bag level, without precise temporal or spatial boundaries. This paper analyzes the importance of negative sampling strategies in weak label learning. We show that careful selection of negative examples significantly impacts model performance and propose principled approaches for constructing effective negative samples. Our experiments demonstrate substantial improvements on audio event detection benchmarks.

BibTeX

@inproceedings{shah2024importance,
  title={Importance of Negative Sampling in Weak Label Learning},
  author={Shah, Ankit and Tang, Fuyu and Ye, Zelin and Singh, Rita and Raj, Bhiksha},
  booktitle={ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing},
  pages={7530--7534},
  year={2024},
  organization={IEEE}
}

Harnessing business and media insights with large language models

Published in arXiv preprint arXiv:2406.06559, 2024

This paper presents applications of large language models for extracting actionable insights from business and media data. We demonstrate how LLMs can be leveraged to analyze news articles, financial reports, and social media content to identify trends, sentiments, and emerging topics relevant to business decision-making. Our system provides automated summarization, entity extraction, and relationship mapping capabilities that enable faster and more comprehensive analysis of large document collections.

BibTeX

@article{bao2024harnessing,
  title={Harnessing Business and Media Insights with Large Language Models},
  author={Bao, Yujia and Shah, Ankit Parag and Narang, Neeru and Rivers, Jonathan and Maksey, Rajeev and Guan, Lan and Barrere, Louise N and Evenson, Shelley and Basole, Rahul and Miao, Connie and others},
  journal={arXiv preprint arXiv:2406.06559},
  year={2024}
}

Automatic dataset construction (ADC): Sample collection, data curation, and beyond

Published in arXiv preprint arXiv:2408.11338, 2024

Building high-quality datasets is crucial for training machine learning models, but manual curation is expensive and time-consuming. This paper presents Automatic Dataset Construction (ADC), a framework for automating the process of sample collection, data curation, and quality control. We describe methods for automatic data filtering, deduplication, and annotation that can significantly reduce the human effort required to create large-scale training datasets while maintaining data quality.

BibTeX

@article{liu2024automatic,
  title={Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond},
  author={Liu, Minghao and Di, Zonglin and Wei, Jiahao and Wang, Zixin and Zhang, Haotian and Xiao, Runze and Wang, Hongru and Pang, Jiajun and Chen, Hao and Shah, Ankit Parag and others},
  journal={arXiv preprint arXiv:2408.11338},
  year={2024}
}

LLM Unlearning via Loss Adjustment with Only Forget Data

Published in arXiv preprint arXiv:2410.11143, 2024

Machine unlearning enables the removal of specific information from trained models without full retraining. This paper proposes a method for unlearning in large language models (LLMs) that requires only access to the data to be forgotten, not the original training data. Our approach uses loss adjustment techniques to selectively reduce the model’s ability to generate or recall specific information while preserving performance on other tasks.

BibTeX

@article{wang2024llm,
  title={LLM Unlearning via Loss Adjustment with Only Forget Data},
  author={Wang, Yaxuan and Wei, Jiahao and Liu, Chenyu and Pang, Jiajun and Liu, Qinghao and Shah, Ankit Parag and Bao, Yujia and Liu, Yang and Wei, Wei},
  journal={arXiv preprint arXiv:2410.11143},
  year={2024}
}

Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

Published in arXiv preprint arXiv:2410.03904, 2024

Audio anomaly detection is crucial for applications ranging from industrial monitoring to smart home security. However, progress in this field is limited by the lack of comprehensive benchmark datasets. This paper introduces AADG (Audio Anomaly Data Generator), a framework for generating synthetic benchmark data for audio anomaly detection. Our framework allows researchers to create customized datasets with controllable anomaly types, signal-to-noise ratios, and background conditions.

BibTeX

@article{raghavan2024did,
  title={Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection},
  author={Raghavan, Ksheeraja and Gode, Samiran and Shah, Ankit and Raghavan, Surabhi and Burgard, Wolfram and Raj, Bhiksha and Singh, Rita},
  journal={arXiv preprint arXiv:2410.03904},
  year={2024}
}

Improving data efficiency via curating LLM-driven rating systems

Published in arXiv preprint arXiv:2410.10877, 2024

Training data quality is critical for machine learning model performance. This paper presents methods for curating training data using LLM-driven rating systems that can assess data quality and relevance. Our approach improves data efficiency by identifying and prioritizing high-quality samples, reducing the amount of data needed to achieve target performance levels while avoiding the costs of manual data annotation.

BibTeX

@article{pang2024improving,
  title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
  author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
  journal={arXiv preprint arXiv:2410.10877},
  year={2024}
}

Computational Audition with Imprecise Labels

Published in Carnegie Mellon University PhD Thesis, 2024

This doctoral dissertation presents research on computational audition techniques that can learn from imprecise labels. The work addresses fundamental challenges in audio understanding when training data has weak, noisy, or partial annotations. The thesis covers methods for sound event detection, audio classification, and representation learning that are robust to various forms of label imprecision, enabling practical deployment of audio AI systems where perfectly labeled data is unavailable.

BibTeX

@phdthesis{shah2024computational,
  title={Computational Audition with Imprecise Labels},
  author={Shah, Ankit Parag},
  school={Carnegie Mellon University},
  year={2024}
}

Imprecise label learning: A unified framework for learning with various imprecise label configurations

Published in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

Learning from imprecise labels is a fundamental challenge in machine learning where labels may be partial, noisy, or uncertain. This paper proposes a unified framework for learning with various imprecise label configurations, including partial labels, complementary labels, noisy labels, and weak labels. Our theoretical analysis establishes connections between different imprecise label settings, and we develop practical algorithms that can handle multiple types of label imprecision simultaneously.

BibTeX

@inproceedings{chen2024imprecise,
  title={Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations},
  author={Chen, Hao and Shah, Ankit and Wang, Jiahao and Tao, Ran and Wang, Yidong and Li, Xing and Xie, Xing and Sugiyama, Masashi and Singh, Rita and Raj, Bhiksha},
  booktitle={Advances in Neural Information Processing Systems 37 (NeurIPS 2024)},
  year={2024}
}

Enhancing Retrieval for ESGLLM via ESG-CID–A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS

Published in arXiv preprint arXiv:2503.10674, 2025

Environmental, Social, and Governance (ESG) reporting requires mapping disclosures across different frameworks such as GRI (Global Reporting Initiative) and ESRS (European Sustainability Reporting Standards). This paper introduces ESG-CID, a Disclosure Content Index finetuning dataset designed to enhance retrieval capabilities for ESG-focused language models. Our dataset enables better mapping between different sustainability reporting frameworks, facilitating more accurate and efficient ESG data analysis.

BibTeX

@article{ahmed2025enhancing,
  title={Enhancing Retrieval for ESGLLM via ESG-CID--A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS},
  author={Ahmed, Shafiuddin Rehan and Shah, Ankit Parag and Tran, Quan Hung and Khetan, Vivek and Kang, Sukryool and Mehta, Ankit and Bao, Yujia and Wei, Wei},
  journal={arXiv preprint arXiv:2503.10674},
  year={2025}
}

ProRefine: Inference-time Prompt Refinement with Textual Feedback

Published in arXiv preprint arXiv:2506.05305, 2025

Effective prompt engineering is crucial for eliciting desired behaviors from large language models. This paper introduces ProRefine, a method for refining prompts at inference time using textual feedback. Our approach iteratively improves prompts based on the model’s outputs and explicit feedback signals, enabling better task performance without modifying the underlying model weights.

BibTeX

@article{pandita2025prorefine,
  title={ProRefine: Inference-time Prompt Refinement with Textual Feedback},
  author={Pandita, Deepak and Weerasooriya, Tharindu Cyril and Shah, Ankit Parag and Homan, Christopher M and Wei, Wei},
  journal={arXiv preprint arXiv:2506.05305},
  year={2025}
}

Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings

Published in arXiv preprint arXiv:2506.20609, 2025

This paper presents methods for classifying firearm types through acoustic analysis of gunshot recordings. We develop a hierarchical classification system that can distinguish between different categories of firearms based on their acoustic signatures. Our analysis reveals distinctive spectral and temporal features that enable reliable firearm identification, with applications in forensic audio analysis and public safety systems.

BibTeX

@article{shah2025deciphering,
  title={Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings},
  author={Shah, Ankit and Singh, Rita and Raj, Bhiksha and Hauptmann, Alexander},
  journal={arXiv preprint arXiv:2506.20609},
  year={2025}
}

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Published in arXiv preprint arXiv:2508.20453, 2025

As LLM agents become more capable of using tools, robust benchmarks are needed to evaluate their performance on complex real-world tasks. This paper introduces MCP-Bench, a benchmark for evaluating tool-using LLM agents through the Model Context Protocol (MCP). Our benchmark includes diverse tasks that require agents to interact with multiple MCP servers, handle multi-step workflows, and manage state across extended interactions.

BibTeX

@article{wang2025mcpbench,
  title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
  author={Wang, Zihan and Chang, Qiyuan and Patel, Harsh and Biju, Sonia and Wu, Chenghan E and Liu, Qinghao and Ding, Aoyuan and Rezazadeh, Arman and Shah, Ankit Parag and others},
  journal={arXiv preprint arXiv:2508.20453},
  year={2025}
}

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

Published in Preprints, 2026

This work surveys the landscape of Small Language Models (SLMs), defined as having fewer than 15 billion parameters. We introduce a novel multi-axis taxonomy to categorize these models by their genesis, architecture, and optimization goals. Our research demonstrates that state-of-the-art SLMs match or exceed larger models in specialized domains like mathematical reasoning and code generation, while suggesting the future of AI involves hybrid ecosystems where specialized SLMs manage most tasks locally, escalating complex queries to cloud-based LLMs.

BibTeX

@article{shah2026small,
  title={Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence},
  author={Shah, Ankit Parag and Hosseini, Mohammad-Parsa and Park, Su Min and Miao, Connie and Wei, Wei},
  journal={Preprints},
  year={2026},
  doi={10.20944/preprints202601.0973}
}

talks

A Framework towards Large scale Learning of Sound Events

Published: August 01, 2017

A Closer Look at Weak Label Learning of Sound Events

Published: July 01, 2018

Learning from Weak Labels

Published: September 18, 2022

Computational Audition with Imprecise Labels

Published: October 11, 2024

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

Teaching experience 2

Workshop, University 1, Department, 2015

volunteer

Teaching Assistant

Published: January 01, 2013

Teaching Assistant

Published: January 01, 2014

Mentor at Junior Academy

Published: March 16, 2017

Contributor/Organizer for Task 4 DCASE 2017 Challenge

Published: November 16, 2017

Organizer for Task 4 DCASE 2018 Challenge

Published: March 01, 2018

Organizer for Task 4 DCASE 2019 Challenge

Published: March 01, 2019