YouTube-Based Dataset of User Comments in the Iraqi Dialect
Keywords:
Hate Speech detection, Iraqi dialect, YouTube, natural language processingAbstract
Social media's widespread use has significantly improved communication between people by enabling quick and unrestricted sharing of ideas and opinions. However, some people have also abused this freedom to disseminate hate speech. As a result, abusive language has become more common online, which has a negative impact on society, especially in areas with an elevated level of dialectal variation. In Iraqi society, where almost every province has its own unique dialect that differ greatly from those in other places, a significant challenge to the creation of efficient hate speech detection technologies is this language diversity. A dataset of 548,661 comments in the Iraqi dialect was first gathered from YouTube for this study. The dataset was condensed to contain 120,000 comments after extensive preprocessing, including the removal of random and duplicate comments, etc. The data were then manually classified into four categories: Abusive, Offensive, Hate speech, and Normal language. For investigations involving the identification of hate speech and other associated natural language processing tasks, this augmented dataset provides a fundamental resource. Without a labeled dataset containing actual instances of linguistically offensive material, Natural Language Processing (NLP) algorithms cannot be trained for hate speech identification. Therefore, this dataset was created to train models on Iraqi hate speech.