University of Limerick Institutional Repository

Investigating the impact of pre‑processing techniques and pre‑trained word embeddings in detecting Arabic health information on social media

DSpace Repository

Show simple item record

dc.contributor.author Albalawi, Yahya
dc.contributor.author Buckley, Jim
dc.contributor.author Nikolov, Nikola S.
dc.date.accessioned 2021-07-13T07:05:06Z
dc.date.available 2021-07-13T07:05:06Z
dc.date.issued 2021
dc.identifier.uri http://hdl.handle.net/10344/10336
dc.description peer-reviewed en_US
dc.description.abstract This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify healthrelated tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset. en_US
dc.language.iso eng en_US
dc.publisher Springer Open en_US
dc.relation.ispartofseries Journal of Big Data;8, 95
dc.subject Deep learning en_US
dc.subject Health information en_US
dc.title Investigating the impact of pre‑processing techniques and pre‑trained word embeddings in detecting Arabic health information on social media en_US
dc.type info:eu-repo/semantics/article en_US
dc.type.supercollection all_ul_research en_US
dc.type.supercollection ul_published_reviewed en_US
dc.identifier.doi 10.1186/s40537-021-00488-w
dc.contributor.sponsor Taibah University Al-Ula, Saudi Arabia en_US
dc.contributor.sponsor SFI en_US
dc.relation.projectid 13/RC/2094 en_US
dc.rights.accessrights info:eu-repo/semantics/openAccess en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search ULIR


Browse

My Account

Statistics