University of Limerick Institutional Repository

Automatic text classification using bag of words and bag of concepts based representations

DSpace Repository

Show simple item record

dc.contributor.advisor Mahdi, Abdulhussain E.
dc.contributor.author Alahmadi, Alaa
dc.date.accessioned 2016-09-12T14:01:16Z
dc.date.available 2016-09-12T14:01:16Z
dc.date.issued 2016
dc.identifier.uri http://hdl.handle.net/10344/5224
dc.description peer-reviewed en_US
dc.description.abstract Automatic Text Classification (ATC) is one of the most important tasks in data mining for organizing information and knowledge discovery. The goal of ATC is to alleviate the need of manually organizing large collections of text documents, which is done by assigning one or more predefined categories to a given textual document via applying appropriate natural language processing techniques. Overall, the classification process involves three components: text pre-processing, text representation and the classifier which is built using one of the Machine Learning (ML) algorithms. In general, all existing text representations are based on the Bag-of-Words (BOW) and Bag-of-Concepts (BOC) models and their variations. The BOW representation model ignores the semantic connections between words by breaking terms into their constituent words, and synonymous words are considered as independent words with no semantic association. The BOW limitations are addressed by using concepts as features in BOC model to represent text in ATC systems. The aim of this work is to investigate and assess the effect of communally available text representation models on the performance of ATC system, in term of the accuracy of the classification and the efficiency of implementation. To achieve that, both BOW and BOC representation models are used with the ATC system and Wikipedia as a knowledge base is utilized to provide concepts. In addition, different strategies that use both words and concepts to build combined models are reviewed and compared to BOW and BOC representation models. Moreover, two languages are used to evaluate these representation models in their ATC system, which are English and Arabic. For Arabic ATC system, different variations of BOW representation models are compared which is a result of different methods that used in text pre-processing component. Furthermore, WordNet as KBs is used to provide concepts to represent Arabic text in the ATC system. This is then followed by attempts to enrich text representation by combining the features of both BOW and BOC models, in order to further enhance the performance of the ATC. Our investigation has resulted in the development of two new strategies, namely Adding Unmapped Concepts (AUC) and Using Concepts for Terms which do not appear in the Document (CTD). Both developed strategies improve ATC systems’ performance in comparison with BOW and BOC representation models. They also bring text classification to a qualitatively new level of performance when compared to other strategies. In addition, CTD developed strategy reduced the time and memory required compared to other strategies used to enrich text representation in ATC systems. The results of our experiments show that text representation is a key element affecting the performance of both English and Arabic ATC systems, and the developed strategies show improvement in both languages in ATC systems. Furthermore, using Wikipedia concepts to build BOC model for Arabic ATC shows more efficiency for representing text than BOW model which does not line with what has been stated in English ATC. The reason behind that is the complex nature of the Arabic language which contains rich morphology and a large degree of the inflections and derivations. In addition, Arabic suffers from poor a morphological tool which makes Wikipedia concepts better features to represent text. en_US
dc.language.iso eng en_US
dc.publisher University of Limerick en_US
dc.subject automatic text classification en_US
dc.subject ATC en_US
dc.subject knowledge discovery en_US
dc.title Automatic text classification using bag of words and bag of concepts based representations en_US
dc.type info:eu-repo/semantics/doctoralThesis en_US
dc.type.supercollection all_ul_research en_US
dc.type.supercollection ul_published_reviewed en_US
dc.type.supercollection ul_theses_dissertations en_US
dc.rights.accessrights info:eu-repo/semantics/openAccess en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search ULIR


Browse

My Account

Statistics