Jigsaw - an Implementation of LDA¶
To further understand the importance of topic modeling, we will be looking into a public dataset, i.e. the Jigsaw Unintended Bias dataset [Jigsaw and Google, 2019].
This dataset consists of ~2m public comments from the Civil Comment [Bogdanoff, 2017] platform so that researchers could understand and improve civility in online conversations. Civil Comments was a social media platform that uses peer-review submission where commenters rated the civility of other comments before their own was, in turn, rated by others.
These comments are then annotated by human raters for various toxic conversational attributes. Additional labels related to sociodemographic identifiers were mentioned to help a machine understand bias better based on context analysis. But we will see the utility of this dataset and the presence of potential bias in these conversations related to people with physical, mental, and learning disabilities.
We have filtered all the comments that have been provided a value for the parameters of ‘Intellectual or Learning Disability. ‘Psychiatric or Mental Illness’. ‘Physical Disability & ‘Other Disability’. We now have 18665 statements in this corpus. The dataset, therefore, contains statements pertaining to these parameters. We can assume that these statements might most likely be about people with disabilities, but with the help of topic modeling, we can confirm this.
Warning
Some comments in the dataset may have explicit language.
We will import the necessary libraries here. One library will also need to be installed.
%%capture
!pip install nltk
%%capture
!pip install sklearn
import pandas as pd
import nltk
import re
from sklearn.decomposition import LatentDirichletAllocation
We will be reading the data into a data frame for easy analysis.
df = pd.read_csv(r'https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/PWD.csv')
This is how the data looks like:
df.head()
Unnamed: 0 | id | comment_text | split | created_date | publication_id | parent_id | article_id | rating | funny | ... | white | asian | latino | other_race_or_ethnicity | physical_disability | intellectual_or_learning_disability | psychiatric_or_mental_illness | other_disability | identity_annotator_count | toxicity_annotator_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7705 | 6216834 | No sympathy for these two knuckleheads. | train | 2017-10-25 00:52:00.913992+00 | 21 | NaN | 392998 | approved | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.0 | 0.0 | 4 | 58 |
1 | 8073 | 5625069 | Wow!\nYour progressive psychosis has become ex... | train | 2017-07-20 03:30:15.579733+00 | 54 | 5624305.0 | 357183 | rejected | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 | 0.0 | 4 | 10 |
2 | 8115 | 5690713 | Or.... maybe there IS chaos because the "presi... | train | 2017-07-31 17:02:58.167475+00 | 102 | 5690153.0 | 361265 | approved | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 | 0.0 | 4 | 62 |
3 | 8125 | 470493 | I'll take someone who's physically ill over on... | train | 2016-09-12 02:41:50.084427+00 | 21 | NaN | 145747 | approved | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.75 | 0.0 | 1.0 | 0.0 | 4 | 68 |
4 | 8263 | 941207 | Mental Illness at work again, again, again, ag... | train | 2017-02-02 22:38:09.291374+00 | 13 | NaN | 165832 | rejected | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 | 0.0 | 4 | 70 |
5 rows × 47 columns
Below are just a few of the comments in this group. You can view more comments by changing the parameter in the ‘head’ function.
pd.set_option('display.max_colwidth', None)
df.comment_text.head(3)
0 No sympathy for these two knuckleheads.
1 Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.
2 Or.... maybe there IS chaos because the "president" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.
Name: comment_text, dtype: object
Let’s try removing unnecessary words and cleaning the statements for analysis of topics.
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\dudas\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
# cleaning master function
def clean_tweet(tweet, bigrams=False):
tweet = tweet.lower() # lower case
tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
tweet = re.sub('\s+', ' ', tweet) #remove double spacing
tweet = re.sub('([0-9]+)', '', tweet) # remove numbers
tweet_token_list = [word for word in tweet.split(' ')
if word not in my_stopwords] # remove stopwords
tweet_token_list = [word_rooter(word) if '#' not in word else word
for word in tweet_token_list] # apply word rooter
if bigrams:
tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
for i in range(len(tweet_token_list)-1)]
tweet = ' '.join(tweet_token_list)
return tweet
t = []
df['clean_tweet'] = df.comment_text.apply(clean_tweet)
We will be converting the statements to a vector format for the machine to understand.
from sklearn.feature_extraction.text import CountVectorizer
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.99, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df['clean_tweet']).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
For the current analysis, let’s define the machine to extract ten unique topics from the dataset (You can play around with the number of topics.).
number_of_topics = 10
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
Here, the machine performs the topic modeling analysis. (This might take a little while).
model.fit(tf)
LatentDirichletAllocation(random_state=0)
#Function to display the topics generated.
def display_topics(model, feature_names, no_top_words):
topic_dict = {}
for topic_idx, topic in enumerate(model.components_):
topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
for i in topic.argsort()[:-no_top_words - 1:-1]]
topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
for i in topic.argsort()[:-no_top_words - 1:-1]]
return pd.DataFrame(topic_dict)
Each column in the table shows the most important words in each topic. With further analysis, we can understand the behavior of the dataset and the type of conversations that occur in them.
Try changing the no_top_words variable to show more or fewer words in each topic.
no_top_words = 15
display_topics(model, tf_feature_names, no_top_words)
Topic 0 words | Topic 0 weights | Topic 1 words | Topic 1 weights | Topic 2 words | Topic 2 weights | Topic 3 words | Topic 3 weights | Topic 4 words | Topic 4 weights | Topic 5 words | Topic 5 weights | Topic 6 words | Topic 6 weights | Topic 7 words | Topic 7 weights | Topic 8 words | Topic 8 weights | Topic 9 words | Topic 9 weights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | trump | 3452.3 | mental | 3351.9 | canada | 591.5 | mental | 1186.5 | gun | 1385.3 | school | 840.5 | mental | 1058.1 | white | 1220.1 | mental | 1836.1 | god | 954.9 |
1 | presid | 1031.5 | ill | 1993.1 | muslim | 582.0 | peopl | 708.3 | mental | 1156.3 | kid | 723.0 | comment | 848.3 | peopl | 1076.2 | peopl | 1793.0 | one | 934.0 |
2 | vote | 813.8 | health | 1213.7 | countri | 539.3 | drug | 555.8 | peopl | 981.1 | year | 590.5 | like | 678.6 | black | 651.0 | health | 1464.6 | women | 905.2 |
3 | like | 780.9 | medic | 706.8 | us | 519.8 | ill | 538.9 | law | 844.9 | go | 514.7 | would | 668.2 | disord | 537.1 | homeless | 1367.5 | life | 830.1 |
4 | elect | 579.5 | http | 630.5 | world | 490.3 | health | 497.7 | kill | 699.6 | time | 507.9 | think | 650.4 | person | 529.5 | care | 1296.8 | peopl | 798.2 |
5 | republican | 550.2 | help | 527.0 | canadian | 462.9 | problem | 491.9 | polic | 683.4 | one | 500.8 | person | 629.7 | ” | 486.9 | need | 1169.6 | would | 740.9 |
6 | obama | 542.2 | need | 504.5 | islam | 448.3 | use | 476.3 | ill | 674.7 | get | 458.6 | say | 608.7 | right | 484.1 | work | 956.9 | church | 686.1 |
7 | peopl | 540.2 | www | 467.5 | right | 409.8 | issu | 456.4 | one | 526.8 | student | 385.0 | know | 607.6 | gender | 454.1 | ill | 903.6 | like | 656.3 |
8 | man | 499.8 | treatment | 453.5 | like | 400.1 | caus | 444.5 | would | 521.2 | work | 373.3 | issu | 572.7 | women | 446.7 | get | 825.9 | men | 634.9 |
9 | democrat | 488.0 | com | 415.4 | govern | 385.2 | suicid | 405.2 | get | 519.6 | need | 370.2 | ill | 569.4 | ’ | 443.7 | help | 795.8 | children | 619.3 |
10 | get | 475.6 | get | 362.2 | peopl | 378.4 | depress | 381.5 | person | 451.4 | would | 369.8 | make | 567.4 | mental | 441.8 | money | 740.8 | know | 613.1 |
11 | white | 453.6 | doctor | 342.3 | christian | 373.7 | gun | 369.9 | right | 422.9 | like | 340.8 | one | 553.2 | group | 415.7 | pay | 702.1 | live | 592.2 |
12 | think | 435.9 | patient | 320.2 | liber | 358.4 | violenc | 320.6 | go | 364.9 | cathol | 337.7 | peopl | 513.0 | hate | 392.3 | problem | 662.4 | child | 582.4 |
13 | support | 432.4 | year | 302.0 | one | 333.1 | alcohol | 310.5 | could | 349.7 | educ | 329.4 | use | 494.8 | male | 363.5 | mani | 660.3 | woman | 581.8 |
14 | donald | 410.1 | disord | 297.7 | blind | 329.4 | addict | 278.2 | guy | 347.4 | church | 309.2 | someon | 482.5 | one | 360.8 | would | 660.1 | love | 553.7 |
We can go through the topics achieved to understand the most common conversations that occur in this dataset. We will be looking into this content in detail in the following section.