WorldCat Identities

Linguistic Data Consortium

Overview
Works: 1,143 works in 1,682 publications in 6 languages and 5,197 library holdings
Genres: Excerpts  Conference papers and proceedings 
Roles: Other
Classifications: P98, 410.285
Publication Timeline
.
Most widely held works about Linguistic Data Consortium
 
Most widely held works by Linguistic Data Consortium
English Web treebank( )

4 editions published between 2012 and 2017 in English and held by 21 WorldCat member libraries worldwide

English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains. *Data* This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated. Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006. Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006. Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users. The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available. Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected
Arabic treebank( )

5 editions published between 2004 and 2010 in Arabic and held by 20 WorldCat member libraries worldwide

"It consists of 599 distinct newswire stories from the Lebanese publication An Nahar with part-of-speech (POS), morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic treebank (PATB) guidelines developed in 2008 and 2009. This release represents a significant revision of LDC's previous ATB3 publications: Arabic treebank: part 3 v 1.0 LDC2004T11 and Arabic treebank: part 3 (full corpus) v 2.0 (MPG + syntactic analysis LDC2005T20"--LDC online catalogue
The New York times annotated corpus( )

2 editions published in 2008 in English and held by 18 WorldCat member libraries worldwide

"The New York Times Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes: over 1.8 million articles (excluding wire services articles that appeared during the covered period); over 650,000 article summaries written; over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors; over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com; Java tools for parsing corpus documents from .xml into a memory resident object. As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles."--Index HTML document
2009 CoNLL shared task( )

3 editions published in 2012 in English and Chinese and held by 17 WorldCat member libraries worldwide

" ... Contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) shared task evaluation"--LDC catalog
ACE 2005 English SpatialML annotations by Christy Doran( )

4 editions published between 2008 and 2011 in English and held by 16 WorldCat member libraries worldwide

The ACE (Automatic Contact Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06
COLING 2000 : proceedings of the 18th International Conference on Computational Linguistics : proceedings of the conference, Universität des Saarlandes, Saarbrücken, Germany, 31 July-4 August 2000 by International Conference on Computational Linguistics( )

1 edition published in 2000 in English and held by 16 WorldCat member libraries worldwide

BBN pronoun coreference and entity type corpus by Ralph M Weischedel( )

2 editions published in 2005 in English and held by 15 WorldCat member libraries worldwide

"This publication supplements the 1 million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types. All annotation was done by hand at BBN using proprietary annotation tools. This corpus was developed by BBN to support the ACE and AQUAINT programs."--Index.html
Chinese Treebank 9.0( )

2 editions published in 2016 in Chinese and held by 14 WorldCat member libraries worldwide

"Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs."--LDC online catalog
Spanish timebank 1.0( )

3 editions published in 2012 in Spanish and held by 14 WorldCat member libraries worldwide

Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Spanish texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language. TimeML (Pusteyovsky, et al., 2005) is a schema for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Spanish Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Spanish language. Temporal relations in Spanish present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Spanish TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resources useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering. Spanish Timebank 1.0 is the Spanish language complement to Catalan Timebank 1.0 LDC2012T10. LDC has released other corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01. *Data* Spanish TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are news stories and fiction from the AnCora corpus. The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC)
2006 NIST/USF evaluation resources for the VACE program, meeting data test set( Visual )

2 editions published in 2011 in English and held by 14 WorldCat member libraries worldwide

"It contains approximately twenty hours of meeting room video data collected in 2005 and 2006 and annotated for the VACE (video analysis and content extraction) 2006 face and person tracking tasks. The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding."--LDC online catalogue
GALE phase 2 Arabic broadcast conversation parallel text, part 1( )

2 editions published in 2012 in Arabic and held by 13 WorldCat member libraries worldwide

"GALE phase 2 Arabic broadcast conversation parallel text part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for phase 2 of the DARPA GALE (Global autonomous language exploitation) program. This corpus contains modern standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction"--LDC online catalogue
Arabic treebank, broadcast news v1.0( )

2 editions published in 2012 in Arabic and held by 13 WorldCat member libraries worldwide

Arabic Treebank - Broadcast News v1.0 was developed at the Linguistic Data Consortium (LDC). It consists of 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation. The ongoing PATB project supports research in Arabic-language natural language processing and human language technology development. The methodology and work leading to the release of this publication are described in detail in the documentation accompanying this corpus. *Data* This release contains 432,976 source tokens before clitics were split, and 517,080 tree tokens after clitics were separated for treebank annotation. The source materials are Arabic broadcast news stories collected by LDC during the period 2005-2008 from the following sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV. The transcripts were produced by LDC
Chinese gigaword fifth edition( Visual )

3 editions published in 2011 in Chinese and held by 13 WorldCat member libraries worldwide

"It is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by LDC at the University of Pennsylvania. Chinese gigaword fifth edition includes all of the content of the fourth edition of Chinese gigaword (LDC2009T27) plus new data covering the period from January 2009 through December 2010"--LDC online catalogue
GALE phase 3 and 4 Chinese broadcast news parallel text( )

2 editions published in 2016 in Chinese and held by 13 WorldCat member libraries worldwide

"This corpus contains Chinese source text and corresponding English translations selected from broadcast news data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction. ... Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features."--LDC catalog
GALE phase 2 Arabic broadcast conversation parallel text, part 2( )

3 editions published in 2012 in Arabic and held by 13 WorldCat member libraries worldwide

"Developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction."--Index.html file
Factbank 1.0 by Roser Sauri( )

2 editions published in 2009 in English and held by 13 WorldCat member libraries worldwide

"FactBank 1.0 ... consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events ... The combination of the factuality values in FactBank with the structural information in TimeML-annotated corpora facilitates the development of tools aimed at automatically identifying the factuality values of events, a component fundamental in tasks requiring some degree of text understanding, such as textual entailment, question answering, or narrative understanding."--LDC online catalogue
English speed networking Conversational transcripts( )

2 editions published in 2016 in English and held by 13 WorldCat member libraries worldwide

"[C]ontains 388 transcripts of English face-to-face and instant messaging conversations about business ideas collected in 2014 and 2015 from participants (undergraduate students) playing different power roles. This corpus was created to examine communication accommodation, specifically, the ways in which an individual's linguistic style, or how an individual communicates, is affected by social power and personality."--LDC catalog
Chinese news translation text( )

2 editions published in 2005 in English and held by 13 WorldCat member libraries worldwide

"To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Chinese source materials. The source Chinese text and its English translations were selected and translated in different LDC projects during the time period of February, 2003 to January, 2005. A total of about 474K Chinese characters were selected from two sources, namely Xinhua and AFP, and translation services were provided by seven translation agencies. Each Chinese news story was translated once."
Mawukakan lexicon( )

2 editions published in 2005 in Multiple languages and Mandingo and held by 13 WorldCat member libraries worldwide

"The Mawukakan Dictionary is the first publication of an on going project aiming to build an Electronic Dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] at the Linguistic Data Consortium (LDC) of the University of Pennsylvania. The other three variants of Mandekan involved are the Bambara or Bamanankan [Mali], the Maninka or Maninkakan [Guinea-Conakry] and the Odienne Jula or Wojenekakan [Cote d'Ivoire]. The lack of written tradition makes such a dictionary project extremely important. For the dictionary of a small language like Mawukakan (less than half of a million speakers) to be the most useful, it has to combine the linguistic component with a cultural component. The fact that the Mawukakan-English lexicon is coupled with a Mawukakan-French one makes this project a bit more important, given the Mawukakan speakers live mostly in the francophone area of West Africa. The project consists in the collection of the largest amount possible of data on the Mandekan and the Manding culture, and making it available electronically at the LDC for the research community."
GALE phase 3 and 4 Chinese broadcast conversation parallel text( )

2 editions published in 2016 in Chinese and held by 13 WorldCat member libraries worldwide

"GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text includes 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation. ... The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features."--LDC catalog
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  General Special  
Audience level: 0.00 (from 0.00 for English We ... to 0.00 for English We ...)

Alternative Names

controlled identityUniversity of Pennsylvania. School of Arts and Sciences

LDC

LDC (Linguistic Data Consortium)

Linguistic Data Consortium maison d'édition

University of Pennsylvania Linguistic Data Consortium

University of Pennsylvania. School of Arts and Sciences. Linguistic Data Consortium

リングイスティック・データ・コーソシアム

Languages