Jigsaw - an Implementation of LDA¶

To further understand the importance of topic modeling, we will be looking into a public dataset, i.e. the Jigsaw Unintended Bias dataset [Jigsaw and Google, 2019].

This dataset consists of ~2m public comments from the Civil Comment [Bogdanoff, 2017] platform so that researchers could understand and improve civility in online conversations. Civil Comments was a social media platform that uses peer-review submission where commenters rated the civility of other comments before their own was, in turn, rated by others.

These comments are then annotated by human raters for various toxic conversational attributes. Additional labels related to sociodemographic identifiers were mentioned to help a machine understand bias better based on context analysis. But we will see the utility of this dataset and the presence of potential bias in these conversations related to people with physical, mental, and learning disabilities.

We have filtered all the comments that have been provided a value for the parameters of ‘Intellectual or Learning Disability. ‘Psychiatric or Mental Illness’. ‘Physical Disability & ‘Other Disability’. We now have 18665 statements in this corpus. The dataset, therefore, contains statements pertaining to these parameters. We can assume that these statements might most likely be about people with disabilities, but with the help of topic modeling, we can confirm this.

Warning

Some comments in the dataset may have explicit language.

We will import the necessary libraries here. One library will also need to be installed.

%%capture
!pip install nltk

%%capture
!pip install sklearn

import pandas as pd
import nltk
import re
from sklearn.decomposition import LatentDirichletAllocation

We will be reading the data into a data frame for easy analysis.

df = pd.read_csv(r'https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/PWD.csv')

This is how the data looks like:

df.head()

	Unnamed: 0	id	comment_text	split	created_date	publication_id	parent_id	article_id	rating	funny	...	physical_disability	psychiatric_or_mental_illness	identity_annotator_count	toxicity_annotator_count
0	7705	6216834	No sympathy for these two knuckleheads.	train	2017-10-25 00:52:00.913992+00	21	NaN	392998	approved	1	...	0.25	0.0	4	58
1	8073	5625069	Wow!\nYour progressive psychosis has become ex...	train	2017-07-20 03:30:15.579733+00	54	5624305.0	357183	rejected	0	...	0.00	1.0	4	10
2	8115	5690713	Or.... maybe there IS chaos because the "presi...	train	2017-07-31 17:02:58.167475+00	102	5690153.0	361265	approved	0	...	0.00	1.0	4	62
3	8125	470493	I'll take someone who's physically ill over on...	train	2016-09-12 02:41:50.084427+00	21	NaN	145747	approved	0	...	0.75	1.0	4	68
4	8263	941207	Mental Illness at work again, again, again, ag...	train	2017-02-02 22:38:09.291374+00	13	NaN	165832	rejected	0	...	0.00	1.0	4	70

5 rows × 47 columns

Below are just a few of the comments in this group. You can view more comments by changing the parameter in the ‘head’ function.

pd.set_option('display.max_colwidth', None)
df.comment_text.head(3)

                                                                                        No sympathy for these two knuckleheads.
                                      Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.
  Or.... maybe there IS chaos because the "president" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.
Name: comment_text, dtype: object

Let’s try removing unnecessary words and cleaning the statements for analysis of topics.

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dudas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

# cleaning master function
def clean_tweet(tweet, bigrams=False):
    tweet = tweet.lower() # lower case
    tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
    tweet = re.sub('\s+', ' ', tweet) #remove double spacing
    tweet = re.sub('([0-9]+)', '', tweet) # remove numbers
    tweet_token_list = [word for word in tweet.split(' ')
                            if word not in my_stopwords] # remove stopwords

    tweet_token_list = [word_rooter(word) if '#' not in word else word
                        for word in tweet_token_list] # apply word rooter
    if bigrams:
        tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
                                            for i in range(len(tweet_token_list)-1)]
    tweet = ' '.join(tweet_token_list)
    return tweet


t = []
df['clean_tweet'] = df.comment_text.apply(clean_tweet)

We will be converting the statements to a vector format for the machine to understand.

from sklearn.feature_extraction.text import CountVectorizer

# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.99, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')

# apply transformation
tf = vectorizer.fit_transform(df['clean_tweet']).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()

For the current analysis, let’s define the machine to extract ten unique topics from the dataset (You can play around with the number of topics.).

number_of_topics = 10

model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)

Here, the machine performs the topic modeling analysis. (This might take a little while).

model.fit(tf)

LatentDirichletAllocation(random_state=0)

#Function to display the topics generated.
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

Each column in the table shows the most important words in each topic. With further analysis, we can understand the behavior of the dataset and the type of conversations that occur in them.

Try changing the no_top_words variable to show more or fewer words in each topic.

no_top_words = 15
display_topics(model, tf_feature_names, no_top_words)

	Topic 0 words	Topic 0 weights	Topic 1 words	Topic 1 weights	Topic 2 words	Topic 2 weights	Topic 3 words	Topic 3 weights	Topic 4 words	Topic 4 weights	Topic 5 words	Topic 5 weights	Topic 6 words	Topic 6 weights	Topic 7 words	Topic 7 weights	Topic 8 words	Topic 8 weights	Topic 9 words	Topic 9 weights
0	trump	3452.3	mental	3351.9	canada	591.5	mental	1186.5	gun	1385.3	school	840.5	mental	1058.1	white	1220.1	mental	1836.1	god	954.9
1	presid	1031.5	ill	1993.1	muslim	582.0	peopl	708.3	mental	1156.3	kid	723.0	comment	848.3	peopl	1076.2	peopl	1793.0	one	934.0
2	vote	813.8	health	1213.7	countri	539.3	drug	555.8	peopl	981.1	year	590.5	like	678.6	black	651.0	health	1464.6	women	905.2
3	like	780.9	medic	706.8	us	519.8	ill	538.9	law	844.9	go	514.7	would	668.2	disord	537.1	homeless	1367.5	life	830.1
4	elect	579.5	http	630.5	world	490.3	health	497.7	kill	699.6	time	507.9	think	650.4	person	529.5	care	1296.8	peopl	798.2
5	republican	550.2	help	527.0	canadian	462.9	problem	491.9	polic	683.4	one	500.8	person	629.7	”	486.9	need	1169.6	would	740.9
6	obama	542.2	need	504.5	islam	448.3	use	476.3	ill	674.7	get	458.6	say	608.7	right	484.1	work	956.9	church	686.1
7	peopl	540.2	www	467.5	right	409.8	issu	456.4	one	526.8	student	385.0	know	607.6	gender	454.1	ill	903.6	like	656.3
8	man	499.8	treatment	453.5	like	400.1	caus	444.5	would	521.2	work	373.3	issu	572.7	women	446.7	get	825.9	men	634.9
9	democrat	488.0	com	415.4	govern	385.2	suicid	405.2	get	519.6	need	370.2	ill	569.4	’	443.7	help	795.8	children	619.3
10	get	475.6	get	362.2	peopl	378.4	depress	381.5	person	451.4	would	369.8	make	567.4	mental	441.8	money	740.8	know	613.1
11	white	453.6	doctor	342.3	christian	373.7	gun	369.9	right	422.9	like	340.8	one	553.2	group	415.7	pay	702.1	live	592.2
12	think	435.9	patient	320.2	liber	358.4	violenc	320.6	go	364.9	cathol	337.7	peopl	513.0	hate	392.3	problem	662.4	child	582.4
13	support	432.4	year	302.0	one	333.1	alcohol	310.5	could	349.7	educ	329.4	use	494.8	male	363.5	mani	660.3	woman	581.8
14	donald	410.1	disord	297.7	blind	329.4	addict	278.2	guy	347.4	church	309.2	someon	482.5	one	360.8	would	660.1	love	553.7

We can go through the topics achieved to understand the most common conversations that occur in this dataset. We will be looking into this content in detail in the following section.

Latent Dirichlet Allocation (LDA) & Biased Data

Jigsaw - an Implementation of LDA¶

Now to the analysis!¶