Visualizing and Analyzing Jigsaw¶

import pandas as pd
import re
import numpy as np

In the previous section, we explored how to generate topics from a textual dataset using LDA. But how can this be used as an application?

Therefore, in this section, we will look into the possible ways to read the topics as well as understand how it can be used.

We will now import the preloaded data of the LDA result that was achieved in the previous section.

df = pd.read_csv("https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/topics.csv")

df.head()

	Unnamed: 0	Topic 0 words	Topic 0 weights	Topic 1 words	Topic 1 weights	Topic 2 words	Topic 2 weights	Topic 3 words	Topic 3 weights	Topic 4 words	...	Topic 5 words	Topic 5 weights	Topic 6 words	Topic 6 weights	Topic 7 words	Topic 7 weights	Topic 8 words	Topic 8 weights	Topic 9 words	Topic 9 weights
0	0	trump	3452.3	mental	3351.9	canada	591.5	mental	1186.5	gun	...	school	840.5	mental	1058.1	white	1220.1	mental	1836.1	god	954.9
1	1	presid	1031.5	ill	1993.1	muslim	582.0	peopl	708.3	mental	...	kid	723.0	comment	848.3	peopl	1076.2	peopl	1793.0	one	934.0
2	2	vote	813.8	health	1213.7	countri	539.3	drug	555.8	peopl	...	year	590.5	like	678.6	black	651.0	health	1464.6	women	905.2
3	3	like	780.9	medic	706.8	us	519.8	ill	538.9	law	...	go	514.7	would	668.2	disord	537.1	homeless	1367.5	life	830.1
4	4	elect	579.5	http	630.5	world	490.3	health	497.7	kill	...	time	507.9	think	650.4	person	529.5	care	1296.8	peopl	798.2

5 rows × 21 columns

We will visualize these results to understand what major themes are present in them.

%%html

<iframe src='https://flo.uri.sh/story/941631/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941631/?utm_source=embed&utm_campaign=story/941631' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

An Overview of the analysis¶

From the above visualization, an anomaly that we come across is that the dataset we are examining is supposed to be related to people with physical, mental, and learning disabilities. But unfortunately, based on the topics that were extracted, we notice just a small subset of words that are related to this topic. Topic 2 has words that address themes related to what we were expecting the dataset to have. But the major theme that was noticed in the Top 5 topics are main terms that are political. (The Top 10 topics show themes related to Religion as well, which is quite interesting.) LDA hence helped in understanding what the conversations the dataset consisted.

From the word collection, we also notice that there were certain words such as ‘kill’ that can be categorized as ‘Toxic’. To analyze this more, we can classify each word because it can be categorized wi by an NLP classifier.

To demonstrate an example of a toxic analysis framework, the below code shows the working of the Unitary library in python. [Hanu and Unitary team, 2020]

This library provides a toxicity score (from a scale of 0 to 1) for the sentence that is passed through it.

To get access to this software, you will need to get an API KEY at https://huggingface.co/unitary/toxic-bert Here is an example of what this would look like.

headers = {"Authorization": f"Bearer api_XXXXXXXXXXXXXXXXXXXXXXXXXXX"}

import requests

API_URL = "https://api-inference.huggingface.co/models/unitary/toxic-bert"

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

query({"inputs": "addict"})

[[{'label': 'toxic', 'score': 0.9272779822349548},
  {'label': 'severe_toxic', 'score': 0.00169223896227777},
  {'label': 'obscene', 'score': 0.03694247826933861},
  {'label': 'threat', 'score': 0.0017220545560121536},
  {'label': 'insult', 'score': 0.02829463966190815},
  {'label': 'identity_hate', 'score': 0.004070617724210024}]]

You can input words or sentences in <insert word here>, in the code, to look at the results that are generated through this.

This example can provide an idea as to how ML can be used for toxicity analysis.

query({"inputs": "<insert word here>"})

[[{'label': 'toxic', 'score': 0.5101907849311829},
  {'label': 'severe_toxic', 'score': 0.07646821439266205},
  {'label': 'obscene', 'score': 0.12113521993160248},
  {'label': 'threat', 'score': 0.07763686031103134},
  {'label': 'insult', 'score': 0.11923719942569733},
  {'label': 'identity_hate', 'score': 0.09533172845840454}]]

%%html

<iframe src='https://flo.uri.sh/story/941681/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941681/?utm_source=embed&utm_campaign=story/941681' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

The Bias¶

The visualization shows how contextually toxic words are derived as important words within various topics related to this dataset. These toxic words can lead to any Natural Language Processing kernel learning this dataset to provide skewed analysis for the population in consideration, i.e., people with mental, physical, and learning disabilities. This can lead to very discriminatory classifications.

An Example¶

To illustrate the impact better, we will be taking the most associated words to the word ‘mental’ from the results. Below is a network graph that shows the commonly associated words. It is seen that words such as ‘Kill’ and ‘Gun’ appear with the closest association. This can lead to the machine contextualizing the word ‘mental’ to be associated with such words.

%%html
<iframe src='https://flo.uri.sh/visualisation/6867000/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6867000/?utm_source=embed&utm_campaign=visualisation/6867000' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

It is hence essential to be aware of the dataset that is being used to analyze a specific population. With LDA, we were able to understand that this dataset cannot be used as a good representation of the disabled community. To bring about a movement of unbiased AI, we need to perform such preliminary analysis and more not to cause unintended discrimination.

The Dashboard¶

Below is the complete data visualization dashboard of the topic analysis. Feel feel to experiment and compare various labels to your liking.

%%html

<iframe src='https://flo.uri.sh/visualisation/6856937/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6856937/?utm_source=embed&utm_campaign=visualisation/6856937' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

Thank you!¶

We thank you for your time!