University of Limerick Institutional Repository

Automatic subject classification of textual documents using limited or no training data

DSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Mahdi, Abdulhussain
dc.contributor.author Joorabchi, Arash
dc.date.accessioned 2011-11-30T12:13:06Z
dc.date.available 2011-11-30T12:13:06Z
dc.date.issued 2010
dc.identifier.uri http://hdl.handle.net/10344/1631
dc.description peer-reviewed en_US
dc.description.abstract With the explosive growth in the number of electronic documents available on the internet, intranets, and digital libraries, there is a growing need for automatic systems capable of indexing and organising such large volumes of data more that ever. Automatic Text Classification (ATC) has become one of the principal means for enhancing the performance of information retrieval systems and organising digital libraries and other textual collections. Within this context, the use of Machine Learning (ML) algorithms has been the dominant approach to ATC since the 1990s. However, one of the major obstacles in the deployment of ML-based ATC systems for practical real-world applications, is the lack or absence of high quality and/or quantity labelled datasets for training the ML algorithms. The aim of this work is to address this problem via investigating two lines of research: (a) the development of new bootstrapping methods which automate the process of creating labelled corpora required for training ML-based ATC systems; and (b) the development of a new breed of ATC algorithms which are unsupervised, and therefore do not require any training data. In order to achieve this aim, the project has mainly focused on utilising two knowledge sources whose potential application in ATC has yet to be fully explored. Namely, the conventional library organisation resources (e.g., library classification schemes, thesauri, and online public access catalogues); and the linkage among documents in form of citation and reference networks. In relation to bootstrapping methods for ML-based ATC systems, our investigation has resulted in the development of two new methods. The developed methods greatly reduce the human involvement in the process of building training datasets by utilising the documents and textual contents that are abundantly available on the Internet as training samples. The other major contribution of this work is the development and evaluation of a new unsupervised ATC method which is capable of classifying a wide range of documents with high accuracy according to a library classification scheme without requiring any training data. This method, which has been named as Bibliography Based ATC (BBATC), is based on the hypothesis that citations and references in a document can be used as primary sources of information to determine the subject of the document with a high accuracy. The proposed BB-ATC method automatically mines the citation and reference networks among the documents and uses the classification metadata of documents which are manually classified to predict the subject/class of unlabelled documents. Finally, our further investigation into the application of citation networks in topical indexing of documents has resulted in the development of a new unsupervised keyword/keyphrase extraction method for scientific documents which is based on the same underlying theorem as the BB-ATC. The developed keyphrase extraction method does not require any training data and yields an accuracy similar to that obtained by human indexers and state-of-the-art ML-based keyphrase extraction methods, whose accuracy is highly dependant on the quality and quantity of the manually labelled training data. en_US
dc.language.iso eng en_US
dc.publisher University of Limerick en_US
dc.subject electronic documents en_US
dc.subject algorithms en_US
dc.title Automatic subject classification of textual documents using limited or no training data en_US
dc.type Doctoral thesis en_US
dc.type.supercollection all_ul_research en_US
dc.type.supercollection ul_published_reviewed en_US
dc.type.supercollection ul_theses_dissertations en_US
dc.type.restriction none en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Related Items

Search DSpace


Advanced Search

Browse

My Account