Jigsaw - an Implementation of LDA

To further understand the importance of topic modeling, we will be looking into a public dataset, i.e. the Jigsaw Unintended Bias dataset [Jigsaw and Google, 2019].

This dataset consists of ~2m public comments from the Civil Comment [Bogdanoff, 2017] platform so that researchers could understand and improve civility in online conversations. Civil Comments was a social media platform that uses peer-review submission where commenters rated the civility of other comments before their own was, in turn, rated by others.

These comments are then annotated by human raters for various toxic conversational attributes. Additional labels related to sociodemographic identifiers were mentioned to help a machine understand bias better based on context analysis. But we will see the utility of this dataset and the presence of potential bias in these conversations related to people with physical, mental, and learning disabilities.

We have filtered all the comments that have been provided a value for the parameters of ‘Intellectual or Learning Disability. ‘Psychiatric or Mental Illness’. ‘Physical Disability & ‘Other Disability’. We now have 18665 statements in this corpus. The dataset, therefore, contains statements pertaining to these parameters. We can assume that these statements might most likely be about people with disabilities, but with the help of topic modeling, we can confirm this.

Warning

Some comments in the dataset may have explicit language.

We will import the necessary libraries here. One library will also need to be installed.

%%capture
!pip install nltk
%%capture
!pip install sklearn
import pandas as pd
import nltk
import re
from sklearn.decomposition import LatentDirichletAllocation

We will be reading the data into a data frame for easy analysis.

df = pd.read_csv(r'https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/PWD.csv')

This is how the data looks like:

df.head()
Unnamed: 0 id comment_text split created_date publication_id parent_id article_id rating funny ... white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability identity_annotator_count toxicity_annotator_count
0 7705 6216834 No sympathy for these two knuckleheads. train 2017-10-25 00:52:00.913992+00 21 NaN 392998 approved 1 ... 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 4 58
1 8073 5625069 Wow!\nYour progressive psychosis has become ex... train 2017-07-20 03:30:15.579733+00 54 5624305.0 357183 rejected 0 ... 0.0 0.0 0.0 0.0 0.00 0.0 1.0 0.0 4 10
2 8115 5690713 Or.... maybe there IS chaos because the "presi... train 2017-07-31 17:02:58.167475+00 102 5690153.0 361265 approved 0 ... 0.0 0.0 0.0 0.0 0.00 0.0 1.0 0.0 4 62
3 8125 470493 I'll take someone who's physically ill over on... train 2016-09-12 02:41:50.084427+00 21 NaN 145747 approved 0 ... 0.0 0.0 0.0 0.0 0.75 0.0 1.0 0.0 4 68
4 8263 941207 Mental Illness at work again, again, again, ag... train 2017-02-02 22:38:09.291374+00 13 NaN 165832 rejected 0 ... 0.0 0.0 0.0 0.0 0.00 0.0 1.0 0.0 4 70

5 rows × 47 columns

Below are just a few of the comments in this group. You can view more comments by changing the parameter in the ‘head’ function.

pd.set_option('display.max_colwidth', None)
df.comment_text.head(3)
0                                                                                          No sympathy for these two knuckleheads.
1                                        Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.
2    Or.... maybe there IS chaos because the "president" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.
Name: comment_text, dtype: object

Let’s try removing unnecessary words and cleaning the statements for analysis of topics.

nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dudas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

# cleaning master function
def clean_tweet(tweet, bigrams=False):
    tweet = tweet.lower() # lower case
    tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
    tweet = re.sub('\s+', ' ', tweet) #remove double spacing
    tweet = re.sub('([0-9]+)', '', tweet) # remove numbers
    tweet_token_list = [word for word in tweet.split(' ')
                            if word not in my_stopwords] # remove stopwords

    tweet_token_list = [word_rooter(word) if '#' not in word else word
                        for word in tweet_token_list] # apply word rooter
    if bigrams:
        tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
                                            for i in range(len(tweet_token_list)-1)]
    tweet = ' '.join(tweet_token_list)
    return tweet


t = []
df['clean_tweet'] = df.comment_text.apply(clean_tweet)

We will be converting the statements to a vector format for the machine to understand.

from sklearn.feature_extraction.text import CountVectorizer

# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.99, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')

# apply transformation
tf = vectorizer.fit_transform(df['clean_tweet']).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()

For the current analysis, let’s define the machine to extract ten unique topics from the dataset (You can play around with the number of topics.).

number_of_topics = 10

model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)

Here, the machine performs the topic modeling analysis. (This might take a little while).

model.fit(tf)
LatentDirichletAllocation(random_state=0)
#Function to display the topics generated.
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

Each column in the table shows the most important words in each topic. With further analysis, we can understand the behavior of the dataset and the type of conversations that occur in them.

Try changing the no_top_words variable to show more or fewer words in each topic.

no_top_words = 15
display_topics(model, tf_feature_names, no_top_words)
Topic 0 words Topic 0 weights Topic 1 words Topic 1 weights Topic 2 words Topic 2 weights Topic 3 words Topic 3 weights Topic 4 words Topic 4 weights Topic 5 words Topic 5 weights Topic 6 words Topic 6 weights Topic 7 words Topic 7 weights Topic 8 words Topic 8 weights Topic 9 words Topic 9 weights
0 trump 3452.3 mental 3351.9 canada 591.5 mental 1186.5 gun 1385.3 school 840.5 mental 1058.1 white 1220.1 mental 1836.1 god 954.9
1 presid 1031.5 ill 1993.1 muslim 582.0 peopl 708.3 mental 1156.3 kid 723.0 comment 848.3 peopl 1076.2 peopl 1793.0 one 934.0
2 vote 813.8 health 1213.7 countri 539.3 drug 555.8 peopl 981.1 year 590.5 like 678.6 black 651.0 health 1464.6 women 905.2
3 like 780.9 medic 706.8 us 519.8 ill 538.9 law 844.9 go 514.7 would 668.2 disord 537.1 homeless 1367.5 life 830.1
4 elect 579.5 http 630.5 world 490.3 health 497.7 kill 699.6 time 507.9 think 650.4 person 529.5 care 1296.8 peopl 798.2
5 republican 550.2 help 527.0 canadian 462.9 problem 491.9 polic 683.4 one 500.8 person 629.7 486.9 need 1169.6 would 740.9
6 obama 542.2 need 504.5 islam 448.3 use 476.3 ill 674.7 get 458.6 say 608.7 right 484.1 work 956.9 church 686.1
7 peopl 540.2 www 467.5 right 409.8 issu 456.4 one 526.8 student 385.0 know 607.6 gender 454.1 ill 903.6 like 656.3
8 man 499.8 treatment 453.5 like 400.1 caus 444.5 would 521.2 work 373.3 issu 572.7 women 446.7 get 825.9 men 634.9
9 democrat 488.0 com 415.4 govern 385.2 suicid 405.2 get 519.6 need 370.2 ill 569.4 443.7 help 795.8 children 619.3
10 get 475.6 get 362.2 peopl 378.4 depress 381.5 person 451.4 would 369.8 make 567.4 mental 441.8 money 740.8 know 613.1
11 white 453.6 doctor 342.3 christian 373.7 gun 369.9 right 422.9 like 340.8 one 553.2 group 415.7 pay 702.1 live 592.2
12 think 435.9 patient 320.2 liber 358.4 violenc 320.6 go 364.9 cathol 337.7 peopl 513.0 hate 392.3 problem 662.4 child 582.4
13 support 432.4 year 302.0 one 333.1 alcohol 310.5 could 349.7 educ 329.4 use 494.8 male 363.5 mani 660.3 woman 581.8
14 donald 410.1 disord 297.7 blind 329.4 addict 278.2 guy 347.4 church 309.2 someon 482.5 one 360.8 would 660.1 love 553.7

We can go through the topics achieved to understand the most common conversations that occur in this dataset. We will be looking into this content in detail in the following section.

Now to the analysis!