Challenge Special Sessions

ASRU 2019 hosts two Challenge Special Sessions. The details are as follows:

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection, Analysing Operational Settings

Organizers: Andreas Nautsch, Xin Wang, Massimiliano Todisco, Md Sahidullah, Ville Vestman
Scientific committee: Junichi Yamagishi, Tomi Kinnunen, Kong Aik Lee, Hector Delgado, Nicholas Evans

The detection of fake audios to spoof automatic speaker verification (ASV) systems is in high demand to facilitate the security of a host of applications employed in smart home applications to online banking and payment solutions. The 2019 “Automatic Speaker Verification Spoofing and Countermeasures” (ASVspoof) challenge has two sub-challenges involving a “Logical Access” scenario (LA; text-to-speech synthesis/voice conversion attacks, TTS/VC) and a “Physical Access” scenario (PA; replay attacks) to advance the future horizons in spoofed and fake audio detection.

For the Interspeech 2019 ASVspoof Challenge, more than 60 international industrial and academic teams submitted automated countermeasures without knowing: i) the attack instrument type (the used TTS/VC algorithms among eleven unknown attack types) in the case of the LA scenario and  ii) the environment where human speech was captured and the replay setup in which (simulated and laboratory scenario testing) presentation attacks were carried out for the PA scenario.

In the ASRU “ASVspoof 2019 Analyses” Special Session, participants get full information access to analyze and strengthen countermeasures in known operational environments, in particular, the key, metadata, and additional data are released to participants via Edinburgh DataShare. The objectives of the ASVspoof 2019 Analysis special session are: 1) LA: Classification of human (bona fide) speech signals vs. machine generated speech signals by state-of-the-art TTS and VC technologies, and 2) PA: Classification of human (bona fide) speech signals vs. (simulated and laboratory scenario testing) captured and replayed speech signals.

We genuinely suggest two alternatives without narrowing the creativity of participants. For one, participants could employ different ASV (or other speech technology) systems to investigate on the tandem performance of spoofed and fake audios for their operational case studies. For another, with the aim of analysing discrimination performance between bona fide and spoofed speech across operational settings, participants could report the maximum ‘min t-DCF’ across all LA and PA settings, respectively.

The Fifth Edition of the Multi-Genre Broadcast Challenge: MGB-5

Organizers: Ahmed Ali, Younes Samih, Ahmed Abdel Ali, Hamdy Mubarak, Suwon Shon, James Glass, Steve Renals, Peter Bell, Khalid Choukri

The MGB-5 challenge is an evaluation of speech recognition and dialect identification techniques using YouTube recordings. The data is highly diverse, spanning the whole range of YouTube genres. Our aim is to encourage researchers to evaluate the latest research techniques using large quantities of realistic data with immediate real-world applications, as well as encouraging approaches to adaptation, semi-supervised and unsupervised learning.

The task of Moroccan Arabic Automatic Speech Recognition comprises 14 hours of speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. We assume that the MGB-5 data is not enough by itself to build robust speech recognition systems, but could be useful for adaptation, and for hyper-parameter tuning of models built using the MGB-2 data. Therefore, we suggest to reuse the MGB-2 training data in this challenge, and consider the provided in-domain data as (supervised) adaptation data. In addition to the transcribed 14 hours, the full programs are also provided, which amounts 48 hours for all the 93 programs. This data can be used for in-domain speech or genre adaptation.

The task of Fine-grained Arabic Dialect Identification (ADI) is dialect identification of speech from YouTube to one of the 17 dialects (ADI17). The previous studies on Arabic dialect identification using audio signal is limited to five dialect classes by lack of speech corpus. To present a fine-grained analysis on the Arabic dialect speech, we collected Arabic dialect from YouTube. For Train set, about 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world was collected. For the development and testing set, about 280 hours speech data was collected. After automatic speaker linking and dialect labeling by human annotators, we selected 57 hours of speech dataset to use as development and testing for performance evaluation.