University of Limerick Institutional Repository

Recognising specific named entities in a new restricted domain using conditional random fields

DSpace Repository

Show simple item record

dc.contributor.advisor Sutcliffe, Richard
dc.contributor.author Gabbay, Igal
dc.date.accessioned 2013-08-22T11:16:17Z
dc.date.available 2013-08-22T11:16:17Z
dc.date.issued 2013
dc.identifier.uri http://hdl.handle.net/10344/3344
dc.description peer-reviewed en_US
dc.description.abstract Named-entity recognition (NER) plays a vital role in information extraction, question answering and text mining. Classic NER research activity has focused on tagging instances of PERSON, LOCATION and ORGANISATION in the newswire domain. New fine-grained NER (FG-NER) covers subtypes of the classic NEs. The goal of this study was to investigate an FG-NER scenario with a set of new specific NEs (SNEs) typical to a new restricted journalistic domain. Reports on birth of animals in zoos were identified as such a productive domain. A 700-document corpus (241K tokens) named ZooBirth was compiled from a newspaper archive and annotated. It contained 2,811 instances of the ten most frequent numerical SNEs shortlisted from 43 candidates. Using Conditional Random Fields allowed testing positional and orderwithin- document features which were hypothesized to improve tagging SNEs. In support of positional features, analysis of distribution of SNEs within documents yielded SNE-specific patterns. The feature token position produced statistically significant but modest improvement in the case of two SNEs (82.2 to 84.4 strict precision, and 59.5 to 61.1 F-measure). Order-effect features improved with statistical significance the F-measure when tagging the weight at birth (from 68.4 to 71.1 strict, and from 75.5 to 80.6 lenient). In the final stage of the study a novel technique named subtractive tagging was introduced to enrich negative examples when training CRF. When tagging the newborn animal’s date of birth and the age of its mother strict recall improved from 52.8 to 60.1 and 65.5 to 68.9, respectively, with statistical significance. en_US
dc.language.iso eng en_US
dc.publisher University of Limerick en_US
dc.subject named-entity recognition en_US
dc.subject NER en_US
dc.subject information extraction en_US
dc.title Recognising specific named entities in a new restricted domain using conditional random fields en_US
dc.type info:eu-repo/semantics/doctoralThesis en_US
dc.type.supercollection all_ul_research en_US
dc.type.supercollection ul_published_reviewed en_US
dc.type.supercollection ul_theses_dissertations en_US
dc.rights.accessrights info:eu-repo/semantics/openAccess en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search ULIR


Browse

My Account

Statistics