YouTube-Based Dataset of User Comments in the Iraqi Dialect

safaa hameed; Asia Mehdi

Authors

safaa hameed college of computer science and information technology university of kerbala Kerbala, Iraq
Asia Mehdi college of computer science and information technology university of kerbala

Keywords:

Hate Speech detection, Iraqi dialect, YouTube, natural language processing

Abstract

Social media's widespread use has significantly improved communication between people by enabling quick and unrestricted sharing of ideas and opinions. However, some people have also abused this freedom to disseminate hate speech. As a result, abusive language has become more common online, which has a negative impact on society, especially in areas with an elevated level of dialectal variation. In Iraqi society, where almost every province has its own unique dialect that differ greatly from those in other places, a significant challenge to the creation of efficient hate speech detection technologies is this language diversity. A dataset of 548,661 comments in the Iraqi dialect was first gathered from YouTube for this study. The dataset was condensed to contain 120,000 comments after extensive preprocessing, including the removal of random and duplicate comments, etc. The data were then manually classified into four categories: Abusive, Offensive, Hate speech, and Normal language. For investigations involving the identification of hate speech and other associated natural language processing tasks, this augmented dataset provides a fundamental resource. Without a labeled dataset containing actual instances of linguistically offensive material, Natural Language Processing (NLP) algorithms cannot be trained for hate speech identification. Therefore, this dataset was created to train models on Iraqi hate speech.

YouTube-Based Dataset of User Comments in the Iraqi Dialect

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

Current Issue

Information

Language