ISLRN

MATERIAL Farsi-English Language Pack

Full Official Name: MATERIAL Farsi-English Language Pack

Submission date: Nov. 19, 2024, 8:35 p.m.

Introduction MATERIAL Farsi-English Language Pack (LDC2024S13) was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 70 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. The Farsi speech in this release represents that spoken in the Greater Tehran, Central/Southwest, Northeast, and Northwest dialect regions of Iran, as well as a standard formal dialect in use throughout the country. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately a third of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Farsi-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded.

Creator(s)

Aric Bills

Sarra Chouder

Cassian Corey

Marjan Davoodian

Eyal Dubinski

Corinna Ellis

Reza Farnam

Paul Gibby

Luke Hartwig

Dagmara Kalnins

Michael Kazi

Julie Lam

Hanh Le

Nicolas Malyska

Sarah Marvi

Sara McConnell

Jennifer Melot

Alyssa Mensch

Alex Moore

Michelle Morrison

Shelley Paget

Frederick Richardson

Annette Roberts

Carl Rubino

Marjan Sadeghi Moaddel

Bern Samko

Kenneth Saw

Pradeepti Sen

Rosanna Smith

Jonathan Taylor

Brian Thompson

Audrey Tong

Richard Tong

Andrew Weller

Sasha Wilmoth

Jennifer Yu

Ilya Zavorin

Distributor(s)

Linguistic Data Consortium

Right Holder(s)

Status : Accepted

ISLRN :

202-347-751-598-9

Version

1.0

Source

https://catalog.ldc.upenn.edu/LDC2024S13

Resource Type

Primary Text

Media Type

Audio

Text

Language(s)

English

Persian

Access Medium

Web Download