Publications

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Published in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2025

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

[Paper Link]

@article{shah2021audio, title={Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning}, author={Shah, Ankit P and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Marks, Tim K and Roux, Jonathan Le and Hori, Chiori}, journal={arXiv preprint arXiv:2110.06894}, year={2021} }

ProRefine: Inference-time Prompt Refinement with Textual Feedback

Published in arXiv preprint arXiv:2506.05305, 2025

Work exploring prompt refinement techniques for large language models using feedback at inference time.

Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Christopher M. Homan, Wei Wei. "ProRefine: Inference-time Prompt Refinement with Textual Feedback." arXiv preprint arXiv:2506.05305, 2025.

Enhancing Retrieval for ESGLLM via ESG-CID–A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS

Published in arXiv preprint arXiv:2503.10674, 2025

Dataset enabling retrieval improvements for ESG-related language models.

Shafiuddin Rehan Ahmed, Ankit Parag Shah, Quan Hung Tran, Vivek Khetan, Sukryool Kang, Ankit Mehta, Yujia Bao, Wei Wei. "Enhancing Retrieval for ESGLLM via ESG-CID--A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS." arXiv preprint arXiv:2503.10674, 2025.

Computational Audition with Imprecise Labels

Published in Carnegie Mellon University PhD Thesis, 2024

Doctoral dissertation on computational audition techniques using imprecise labels.

Ankit Parag Shah. "Computational Audition with Imprecise Labels." PhD thesis, Carnegie Mellon University, 2024.

Improving data efficiency via curating LLM-driven rating systems

Published in arXiv preprint arXiv:2410.10877, 2024

Paper presenting rating system curation strategies for large language models.

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, Wei Wei. "Improving data efficiency via curating LLM-driven rating systems." arXiv preprint arXiv:2410.10877, 2024.

Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

Published in arXiv preprint arXiv:2410.03904, 2024

Framework for generating benchmark datasets for audio anomaly detection.

Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh. "Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection." arXiv preprint arXiv:2410.03904, 2024.

LLM Unlearning via Loss Adjustment with Only Forget Data

Published in arXiv preprint arXiv:2410.11143, 2024

Method for removing information from large language models using only data to forget.

Y Wang, J Wei, C.Y. Liu, J Pang, Q Liu, A.P. Shah, Y Bao, Y Liu, W Wei, "LLM Unlearning via Loss Adjustment with Only Forget Data", arXiv preprint arXiv:2410.11143, 2024.

Automatic dataset construction (adc): Sample collection, data curation, and beyond

Published in arXiv preprint arXiv:2408.11338, 2024

Explores automated processes for building and refining large datasets.

M Liu, Z Di, J Wei, Z Wang, H Zhang, R Xiao, H Wang, J Pang, H Chen, ... "Automatic dataset construction (ADC): Sample collection, data curation, and beyond", arXiv preprint arXiv:2408.11338, 2024.

Harnessing business and media insights with large language models

Published in arXiv preprint arXiv:2406.06559, 2024

Application of large language models for extracting business and media insights.

Yujia Bao, Ankit Parag Shah, Neeru Narang, Jonathan Rivers, Rajeev Maksey, Lan Guan, Louise N Barrere, Shelley Evenson, Rahul Basole, Connie Miao, et al. "Harnessing business and media insights with large language models." arXiv preprint arXiv:2406.06559, 2024.

Importance of negative sampling in weak label learning

Published in ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

Conference paper analyzing negative sampling strategies for weak label learning.

Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj. "Importance of negative sampling in weak label learning." In ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7530-7534, 2024.

Imprecise label learning: A unified framework for learning with various imprecise label configurations

Published in Advances in Neural Information Processing Systems 37, 2024

Comprehensive study proposing a unified framework for learning from imprecisely labeled data.

H Chen, A Shah, J Wang, R Tao, Y Wang, X Li, X Xie, M Sugiyama, ... "Imprecise label learning: A unified framework for learning with various imprecise label configurations", Advances in Neural Information Processing Systems 37, 2024.

Conformer is All You Need for Visual Speech Recognition

Published in ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

A study on applying Conformer architecture to visual speech recognition tasks.

O Chang, H Liao, D Serdyuk, A Shah, O Siohan, "Conformer is All You Need for Visual Speech Recognition," in ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024.

Overview of the Tenth Dialog System Technology Challenge: DSTC10

Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

Comprehensive overview of the DSTC10 challenge and participating systems.

Koichiro Yoshino, Yun-Nung Chen, Paul Crook, Satwik Kottur, Jinchao Li, Behnam Hedayatnia, Seungwhan Moon, Zhengcong Fei, Zekang Li, Jinchao Zhang, et al. "Overview of the Tenth Dialog System Technology Challenge: DSTC10." IEEE/ACM Transactions on Audio, Speech, and Language Processing 32:765-778, 2023.

Audio-visual fine-tuning of audio-only ASR models

Published in arXiv preprint arXiv:2312.09369, 2023

Investigates using visual information to fine-tune audio-only automatic speech recognition models.

Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan. "Audio-visual fine-tuning of audio-only ASR models." arXiv preprint arXiv:2312.09369, 2023.

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

Published in arXiv preprint arXiv:2310.07161, 2023

Discussion of psychoacoustic aspects affecting speech enhancement algorithms for VoIP.

Joseph Konan, Shikhar Agnihotri, Ojas Bhargave, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj. "Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms." arXiv preprint arXiv:2310.07161, 2023.

Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model

Published in arXiv preprint arXiv:2310.04445, 2023

Research on enhancing adversarial attack transferability for large language models using local proxy fine-tuning.

M.A. Shah, R. Sharma, H. Dhamyal, R. Olivier, A. Shah, J. Konan, D. Alharthi, ... "Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model", arXiv preprint arXiv:2310.04445, 2023.

Online Active Learning For Sound Event Detection

Published in arXiv preprint arXiv:2309.14460, 2023

Paper proposing an online active learning method for sound event detection.

Mark Lindsey, Ankit Shah, Francis Kubala, Richard M. Stern. "Online Active Learning For Sound Event Detection." arXiv preprint arXiv:2309.14460, 2023.

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

Published in arXiv preprint arXiv:2309.04641, 2023

Investigation of improvements for neural Foley sound synthesis systems.

Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah. "Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer." arXiv preprint arXiv:2309.04641, 2023.

Understanding and mitigating the label noise in pre-training on downstream tasks

Published in arXiv preprint arXiv:2309.17002, 2023

Work exploring strategies to handle label noise during model pre-training.

H Chen, J Wang, A Shah, R Tao, H Wei, X Xie, M Sugiyama, B Raj, "Understanding and mitigating the label noise in pre-training on downstream tasks", arXiv preprint arXiv:2309.17002, 2023.

DCASE task 7: Foley sound synthesis

Published in Technical Report, June 2023, 2023

Technical report describing the DCASE task on Foley sound synthesis.

Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah. "DCASE task 7: Foley sound synthesis." Technical Report, June 2023.

An approach to ontological learning from weak labels

Published in ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

Conference paper introducing an ontological learning method using weak labels.

Ankit Shah, Larry Tang, Po Hao Chou, Yi Yu Zheng, Ziqian Ge, Bhiksha Raj. "An approach to ontological learning from weak labels." In ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1-5, 2023.

Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

Published in arXiv preprint arXiv:2303.09048, 2023

Study exploring signal processing improvements for VoIP communications.

Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Hojeong Lee, Ankit Shah, Shuo Han, Yunyang Zeng, Amanda Shu, Haohui Liu, Xuankai Chang, et al. "Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms." arXiv preprint arXiv:2303.09048, 2023.

Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms

Published in arXiv preprint arXiv:2303.03591, 2023

Technique for learning robust audio representations using embedding covariance regularization.

Ankit Shah, Shuyi Chen, Kejun Zhou, Yue Chen, Bhiksha Raj. "Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms." arXiv preprint arXiv:2303.03591, 2023.

Automated Audio Captioning and Language-Based Audio Retrieval

Published in arXiv preprint arXiv:2207.04156, 2022

Work on captioning audio clips and retrieving audio using language descriptions.

Clive Gomes, Hyejin Park, Patrick Kollman, Yi Song, Iffanice Houndayi, Ankit Shah. "Automated Audio Captioning and Language-Based Audio Retrieval." arXiv preprint arXiv:2207.04156, 2022.

On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice

Published in arXiv, 2022

Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degrees of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter is not only more accurate and interpretable and also more computationally efficient in that they can be run locally on small devices. We demonstrate this from a human-curated dataset collected and calibrated in clinical settings. On this dataset which comprises over 1000 speakers, a simple binary classifier is able to achieve 94% detection accuracy.

[Paper Link] [Webpage]

@article{shah2022pragmatism,title={On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice}, author={Shah, Ankit and Dhamyal, Hira and Gao, Yang and Singh, Rita and Raj, Bhiksha}, journal={arXiv preprint arXiv:2204.04802}, year={2022}}

DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022

We participated in the third challenge for the Audio-Visual Scene-Aware Dialog (AVSD) task in DSTC10. The target of the task was updated by two modifications: 1) the humancre- ated description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. The baseline system built using an AV-transformer was released along with the new dataset including temporal reasoning for DSTC10-AVSD. This paper introduces a new system that extends the baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network (RPN). We confirmed our system outperformed the baseline system and the previous state of the art for the AVSD test sets for DSTC7, DSTC8, and DSTC10. Furthermore, the temporal reasoning using RPN outperformed the attention method of the baseline system.

[Paper Link] [Webpage]

@inproceedings{shah2022dstc10, title={DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning}, author={Shah, Ankit Parag and Hori, Takaaki and Le Roux, Jonathan and Hori, Chiori}}

Ontological Learning from Weak Labels

Published in arXiv, 2022

Ontologies encompass a formal representation of knowledge through the definition of concepts or properties of a domain, and the relationships between those concepts. In this work, we seek to investigate whether using this ontological information will improve learning from weakly labeled data, which are easier to collect since it requires only the presence or absence of an event to be known. We use the AudioSet ontology and dataset, which contains audio clips weakly labeled with the ontology concepts and the ontology providing the "Is A" relations between the concepts. We first re-implemented the model proposed by soundevent_ontology with modification to fit the multi-label scenario and then expand on that idea by using a Graph Convolutional Network (GCN) to model the ontology information to learn the concepts. We find that the baseline Siamese does not perform better by incorporating ontology information in the weak and multi-label scenario, but that the GCN does capture the ontology knowledge better for weak, multi-labeled data. In our experiments, we also investigate how different modules can tolerate noises introduced from weak labels and better incorporate ontology information. Our best Siamese-GCN model achieves mAP=0.45 and AUC=0.87 for lower-level concepts and mAP=0.72 and AUC=0.86 for higher-level concepts, which is an improvement over the baseline Siamese but about the same as our models that do not use ontology information.

[Paper Link] [Webpage]

@article{tang2022ontological, title={Ontological Learning from Weak Labels}, author={Tang, Larry and Chou, Po Hao and Zheng, Yi Yu and Ge, Ziqian and Shah, Ankit and Raj, Bhiksha}, journal={arXiv preprint arXiv:2203.02483}, year={2022}}

Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10

Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022

The Audio-Visual Scene-Aware Dialog (AVSD) task was proposed in the Dialog System Technology Challenge (DSTC), where an AVSD dataset was collected and AVSD technologies were developed. An AVSD challenge track was hosted at both the 7th and 8th DSTCs (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on humangenerated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, a third AVSD challenge is proposed, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and the new extension of the AVSD dataset for DSTC10, for which humangenerated temporal reasoning data were collected. A baseline system was built using an AV-transformer and the new datasets were released for the challenge. Finally, this paper reports the challenge results of 12 systems submitted to the AVSD task in DSTC10. The two systems using GPT-2 based multimodal transformer have achieved the best performance for human rating, BLEU4 and CIDEr. The temporal reasoning performed by those systems has outperformed the baseline method with temporal attention.

[Paper Link] [Webpage]

@article{horioverview, title={Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10}, author={Hori, Chiori and Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Le Roux, Jonathan and Marks, Tim K} }

Triple Attention Network architecture for MovieQA

Published in arXiv preprint arXiv:2111.09531, 2021

Model proposing a triple attention network for question answering on movies.

Ankit Shah, Tzu-Hsiang Lin, Shijie Wu. "Triple Attention Network architecture for MovieQA." arXiv preprint arXiv:2111.09531, 2021.

Reasoning for Audio Visual Scene-Aware DialogTrack in DSTC10

Published in DSTC10, 2021

Overview of reasoning approaches for the AVSD track in DSTC10.

Shijie Geng, Peng Gao, Anoop Cherian, Tim K. Marks, Chiori Hori, Ankit Shah. "Reasoning for Audio Visual Scene-Aware DialogTrack in DSTC10." DSTC10.

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Published in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2021

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

[Paper Link] [Webpage]

@article{shah2021audio, title={Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning}, author={Shah, Ankit P and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Marks, Tim K and Roux, Jonathan Le and Hori, Chiori}, journal={arXiv preprint arXiv:2110.06894}, year={2021} }

An overview of techniques for biomarker discovery in voice signal

Published in arXiv preprint arXiv:2110.04678, 2021

Survey of voice-based biomarker discovery methods.

Rita Singh, Ankit Shah, Hira Dhamyal. "An overview of techniques for biomarker discovery in voice signal." arXiv preprint arXiv:2110.04678, 2021.

Feature extraction and evaluation for BioMedical Question Answering

Published in arXiv preprint arXiv:2105.14013, 2021

Study on feature extraction techniques for biomedical QA tasks.

Ankit Shah, Srishti Singh, Shih-Yen Tao. "Feature extraction and evaluation for BioMedical Question Answering." arXiv preprint arXiv:2105.14013, 2021.

Training image classifiers using Semi-Weak Label Data

Published in arXiv, 2021

In Multiple Instance learning (MIL), weak labels are provided at the bag level with only presence/absence information known. However, there is a considerable gap in performance in comparison to a fully supervised model, limiting the practical applicability of MIL approaches. Thus, this paper introduces a novel semi-weak label learning paradigm as a middle ground to mitigate the problem. We define semi-weak label data as data where we know the presence or absence of a given class and the exact count of each class as opposed to knowing the label proportions. We then propose a two-stage framework to address the problem of learning from semi-weak labels. It leverages the fact that counting information is non-negative and discrete. Experiments are conducted on generated samples from CIFAR-10. We compare our model with a fully-supervised setting baseline, a weakly-supervised setting baseline and learning from pro-portion (LLP) baseline. Our framework not only outperforms both baseline models for MIL-based weakly super-vised setting and learning from proportion setting, but also gives comparable results compared to the fully supervised model. Further, we conduct thorough ablation studies to analyze across datasets and variation with batch size, losses architectural changes, bag size and regularization

[Paper Link] [Webpage]

@article{zhang2021training,title={Training image classifiers using Semi-Weak Label Data},author={Zhang, Anxiang* and Shah, Ankit* and Raj, Bhiksha}, journal={arXiv preprint arXiv:2103.10608},year={2021}}

Sound event detection in synthetic domestic environments

Published in 45th International Conference on Acoustics, Speech, and Signal Processing 2020, 2020

We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.

[Paper Link] [Webpage]

Romain Serizel, Nicolas Turpault, Ankit Shah, Justin Salamon, Sound event detection in synthetic domestic environments, 45th International Conference on Acoustics, Speech, and Signal Processing 2020

CMU sounds for covid project

Published in https://node.dev.cvd.lti.cmu.edu/, 2020

Project description for the CMU sounds for COVID initiative.

Bhiksha Raj, Rita Singh, Ankit Shah, Benjamin Striner, Shahan Ali Memon, Vedant Sanil, Soham Deshmukh, Ruan Kangrui, Mahmoud Al Ismail, Hira Dhamyal, Majd Sakr, Nicholas Wolfe, Dr. Shmuel Ur. "CMU sounds for covid project", 2020.

Sound event detection in synthetic domestic environments

Published in 45th International Conference on Acoustics, Speech, and Signal Processing 2020, 2020

We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.

[Paper Link] [Webpage]

Romain Serizel, Nicolas Turpault, Ankit Shah, Justin Salamon, Sound event detection in synthetic domestic environments, 45th International Conference on Acoustics, Speech, and Signal Processing 2020

Multimodal Behavioral Markers Exploring Suicidal Intent in Social Media Videos

Published in 21st ACM International Conference on Multimodal Interaction 2019, 2019

Suicide is one of the leading causes of death in the modern world. In this digital age, individuals are increasingly using social media to express themselves and often use these platforms to express suicidal intent. Various studies have inspected suicidal intent behavioral markers in controlled environments but it is still unexplored if such markers will generalize to suicidal intent expressed on social media. In this work, we set out to study multimodal behavioral markers related to suicidal intent when expressed on social media videos. We explore verbal, acoustic and visual behavioral markers in the context of identifying individuals at higher risk of suicidal attempt. Our analysis reveals a set of predominant multimodal behavioral markers indicative of suicidal intent on social media videos.

[Paper Link] [Webpage]

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Published in Detection and Classification of Acoustic Scenes and Events 2019, 2019

[Paper Link] [Webpage]

Nicolas Tarpault, Romain Serizel, Ankit Shah, Justin Salamon, "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis", Detection and Classification of Acoustic Scenes and Events 2019

Learning Sound Events From Webly Labeled Data

Published in 2019 International Joint Conference on Artificial Intelligence, 2018

In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection. In this work, we introduce webly labeled learning for sound events in which we aim to remove human supervision altogether from the learning process. We first develop a method of obtaining labeled audio data from the web (albeit noisy), in which no manual labeling is involved. We then describe deep learning methods to efficiently learn from these webly labeled audio recordings. In our proposed system, WeblyNet, two deep neural networks co-teach each other to robustly learn from webly labeled data, leading to around 17% relative improvement over the baseline method. The method also involves transfer learning to obtain efficient representations.

[Paper Link] [Webpage]

Anurag Kumar, Ankit Shah, Alex Hauptmann, Bhiksha Raj,"Learning Sound Events From Webly Labeled Data", Preprint ArXiV 2018

Tartan: A retrieval-based socialbot powered by a dynamic finite-state machine architecture

Published in 2nd Proceedings of Alexa Prize (Alexa Prize 2018)., 2018

This paper describes the Tartan conversational agent built for the 2018 Alexa Prize Competition. Tartan is a non-goal-oriented socialbot focused around providing users with an engaging and fluent casual conversation. Tartan's key features include an emphasis on structured conversation based on flexible finite-state models and an approach focused on understanding and using conversational acts. To provide engaging conversations, Tartan blends script-like yet dynamic responses with data-based generative and retrieval models. Unique to Tartan is that our dialog manager is modeled as a dynamic Finite State Machine. To our knowledge, no other conversational agent implementation has followed this specific structure.

[Paper Link]

George Larionov, Zachary Kaden, Hima Varsha Dureddy, Gabriel Bayomi T. Kalejaiye, Mihir Kale, Srividya Pranavi Potharaju, Ankit Parag Shah, Alexander I Rudnicky, "Tartan: A retrieval-based socialbot powered by a dynamic finite-state machine architecture", 2nd Proceedings of Alexa Prize (Alexa Prize 2018).

Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

Published in Detection and Classification of Acoustic Scenes and Events 2018, 2018

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events.. .) and potential industrial applications

[Paper Link] [Webpage]

Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh, Ankit Parag Shah. "Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments", Detection and Classification of Acoustic Scenes and Events 2018

CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis

Published in IEEE International Workshop on Traffic and Street Surveillance for Safety and Security, 2018, 2018

This paper presents a novel dataset for traffic accidents analysis. Our goal is to resolve the lack of public data for research about automatic spatio-temporal annotations for traffic safety in the roads. Through the analysis of the proposed dataset, we observed a significant degradation of object detection in pedestrian category in our dataset, due to the object sizes and complexity of the scenes. To this end, we propose to integrate contextual information into conventional Faster R-CNN using Context Mining (CM) and Augmented Context Mining (ACM) to complement the accuracy for small pedestrian detection. Our experiments indicate a considerable improvement in object detection accuracy: +8.51% for CM and +6.20% for ACM. Finally, we demonstrate the performance of accident forecasting in our dataset using Faster R-CNN and an Accident LSTM architecture. We achieved an average of 1.684 seconds in terms of Time-To-Accident measure with an Average Precision of 47.25%.

[Paper Link] [Webpage]

Ankit Shah*, Jean Baptiste Lamare*, Tuan Nguyen Anh*, Alexander Hauptmann, "CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis" international Workshop on Traffic and Street Surveillance for Safety and Security, Nov 2018.

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Published in arXiv preprint arXiv:1809.00241, 2018

Large-scale activity recognition study using the Moments in Time dataset.

Ankit Shah, Harini Kesavamoorthy, Poorva Rane, Pramati Kalwad, Alexander Hauptmann, Florian Metze. "Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset." arXiv preprint arXiv:1809.00241, 2018.

Natural Language Person Search Using Deep Reinforcement Learning

Published in arXiv preprint arXiv:1809.00365, 2018

Method for person search in videos using natural language and reinforcement learning.

Ankit Shah, Tyler Vuong. "Natural Language Person Search Using Deep Reinforcement Learning." arXiv preprint arXiv:1809.00365, 2018.

A Closer Look at Weak Label Learning for Audio Events

Published in Preprint and Under Review, 2018

Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics, which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments.

[Paper Link]

Ankit Shah,Anurag Kumar, Alexander Hauptmann, Bhiksha Raj, "A Closer Look at Weak Label Learning for Audio Events", ArXiv e-prints, 2018

NELS-Never-Ending Learner of Sounds

Published in Neural Information Processing Systems (NIPS 2017), 2018

Sounds are essential to how humans perceive and interact with the world. These 10 sounds are captured in recordings and shared on the Internet on a minute-by- 11 minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we’ve ever seen. However, most of these recordings have undescribed content making necessary methods for automatic audio content analysis, indexing and retrieval. These methods have to address multiple challenges, such as the relation between sounds and language, numerous and diverse sound classes, and large-scale evaluation. We propose a system that continuously learns from the web relations between sounds and language, improves sound recognition models over time and evaluates its learning competency in the large-scale without references. We introduce the Never-Ending Learner of Sounds (NELS), a project for continuously learning of sounds and their associated knowledge, available on line in nels.cs.cmu.edu

[Paper Link]

Elizalde, Benjamin, Rohan Badlani, Ankit Shah, Anurag Kumar, and Bhiksha Raj. "NELS-Never-Ending Learner of Sounds."

Framework for evaluation of sound event detection in web videos

Published in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, 2017

The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found to their metadata by the search engine. In this paper we explore the extent to which a search query could be used as the true label for the presence of sound events in the videos. For this, we developed a framework for large-scale sound event recognition on web videos. The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets. The datasets are used to train three classifiers, which were then run on 3.7 million video segments. We evaluated performance using the search query as the true label and compare it (on a subset) with human labeling. Both types exhibited close performance, to within 10%, and similar performance trends as the number of evaluated segments increased. Hence, our experiments show potential for using search query as a preliminary true label for sound events in web videos.

[Paper Link]

Badlani, Rohan, Ankit Shah, Benjamin Elizalde, Anurag Kumar, and Bhiksha Raj. "Framework for evaluation of sound event detection in web videos." arXiv preprint arXiv:1711.00804 (2017).

Content-based Representations of audio using Siamese neural networks

Published in IEEE International Conference on Acoustics , Speech and Signal Processing, 2018, 2017

In this paper, we focus on the problem of content-based retrieval for audio, which aims to retrieve all semantically similar audio recordings for a given audio clip query. We propose a novel approach which encodes the audio into a vector representation using Siamese Neural Networks. The goal is to obtain an encoding similar for files belonging to the same audio class, thus allowing retrieval of semantically similar audio. We used two similarity measures, Cosine similarity and Euclidean distance, to show that our method is effective in retrieving files similar in audio content. Our results indicate that our neural network-based approach is able to retrieve files similar in content and semantics

[Paper Link]

Manocha, Pranay, Rohan Badlani, Anurag Kumar, Ankit Shah, Benjamin Elizalde, and Bhiksha Raj. "Content-based Representations of audio using Siamese neural networks." arXiv preprint arXiv:1710.10974 (2017).

DCASE 2017 challenge setup: tasks, datasets and baseline system

Published in Detection and Classification of Acoustic Scenes and Events 2017 Workshop, 2017

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics

[Paper Link]

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017) , November 2017

An Approach for Self Training Audio Event Detectors using Web Data

Published in 25th European Signal Processing Conference (EUSIPCO), 2017

Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was use to re-train the detectors. The performance of the re-trained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos

[Paper Link]

Ankit Shah, Rohan Badlani, Anurag Kumar, Benjamin Elizalde, Bhiksha Raj; An Approach for Self-Training Audio Event Detectors Using Web Data",in 25th European Signal Processing Conference (EUSIPCO), 2017

Archive ouverte HAL

Published in , 2016

Open-access record referencing work on acoustic scene classification and sound event detection.

Benjamin Elizalde, Anurag Kumar, Ankit Shah, Rohan Badlani, Emmanuel Vincent. "Archive ouverte HAL."

An approach for self-training audio event detectors using web data

Published in arXiv preprint arXiv:1609.06026, 2016

Early work proposing self-training of audio event detectors using web data.

Ankit Shah, Rohan Badlani, Anurag Kumar, Benjamin Elizalde, Bhiksha Raj. "An approach for self-training audio event detectors using web data." arXiv preprint arXiv:1609.06026, 2016.

Experiments on DCASE Challenge 2016 Acoustic Scene Classification and Sound Event Detection in Real Life Recording

Published in IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events., 2016

In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our 14 performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.76 compared to the baseline of 0.91'

[Paper Link]

Elizalde, Benjamin, Anurag Kumar, Ankit Shah, Rohan Badlani, Emmanuel Vincent, Bhiksha Raj, and Ian Lane. "Experimentation on the DCASE challenge 2016: Task 1—Acoustic scene classification and task 3—Sound event detection in real life audio." IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).

Pipelined implementation of high radix adaptive CORDIC as a coprocessor

Published in 2015 International Conference on Computing and Network Communications (CoCoNet) , 2016

The Coordinate Rotational Digital Computer (CORDIC) algorithm allows computation of trigonometric, hyperbolic, natural log and square root functions. This iterative algorithm uses only shift and add operations to converge. Multiple fixed radix variants of the algorithm have been implemented on hardware. These have demonstrated faster convergence at the expense of reduced accuracy. High radix adaptive variants of CORDIC also exist in literature. These allow for faster convergence at the expense of hardware multipliers in the datapath without compromising on the accuracy of the results. This paper proposes a 12 stage deep pipeline architecture to implement a high radix adaptive CORDIC algorithm. It employs floating point multipliers in place of the conventional shift and add architecture of fixed radix CORDIC. This design has been synthesised on a FPGA board to act as a coprocessor. The paper also studies the power, latency and accuracy of this implementation.

[Paper Link]

S. S. Oza, A. P. Shah, T. Thokala and S. David, "Pipelined implementation of high radix adaptive CORDIC as a coprocessor," 2015 International Conference on Computing and Network Communications (CoCoNet), Trivandrum, 2015, pp. 333-342.

Repeatability and Scalability of Code at Top level Verification

Published in Regional Engineering Conference 2016, 2016

Conference paper discussing methods to ensure repeatable and scalable verification code.

Ankit Shah, Ajith Bhat, Rashmin Mantri, Saurabh Saxena, Rishiraj, Shanavas. "Repeatability and Scalability of Code at Top level Verification." Regional Engineering Conference, 2016.

Hardware Architecture for High Radix Adaptive CORDIC Algorithm

Published in National Institute of Technology Karnataka Surathkal, 2015

PhD thesis presenting a hardware architecture for an adaptive CORDIC algorithm.

Ankit Shah, Saharsh Samir Oza, Tarun Thokala, Pratik Gujjar, Sumam David. "Hardware Architecture for High Radix Adaptive CORDIC Algorithm." PhD thesis, National Institute of Technology Karnataka Surathkal, 2015.