Visualizing and Analyzing Jigsaw

import pandas as pd
import re
import numpy as np

In the previous section, we explored how to generate topics from a textual dataset using LDA. But how can this be used as an application?

Therefore, in this section, we will look into the possible ways to read the topics as well as understand how it can be used.

We will now import the preloaded data of the LDA result that was achieved in the previous section.

df = pd.read_csv("https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/topics.csv")
df.head()
Unnamed: 0 Topic 0 words Topic 0 weights Topic 1 words Topic 1 weights Topic 2 words Topic 2 weights Topic 3 words Topic 3 weights Topic 4 words ... Topic 5 words Topic 5 weights Topic 6 words Topic 6 weights Topic 7 words Topic 7 weights Topic 8 words Topic 8 weights Topic 9 words Topic 9 weights
0 0 trump 3452.3 mental 3351.9 canada 591.5 mental 1186.5 gun ... school 840.5 mental 1058.1 white 1220.1 mental 1836.1 god 954.9
1 1 presid 1031.5 ill 1993.1 muslim 582.0 peopl 708.3 mental ... kid 723.0 comment 848.3 peopl 1076.2 peopl 1793.0 one 934.0
2 2 vote 813.8 health 1213.7 countri 539.3 drug 555.8 peopl ... year 590.5 like 678.6 black 651.0 health 1464.6 women 905.2
3 3 like 780.9 medic 706.8 us 519.8 ill 538.9 law ... go 514.7 would 668.2 disord 537.1 homeless 1367.5 life 830.1
4 4 elect 579.5 http 630.5 world 490.3 health 497.7 kill ... time 507.9 think 650.4 person 529.5 care 1296.8 peopl 798.2

5 rows × 21 columns

We will visualize these results to understand what major themes are present in them.

%%html

<iframe src='https://flo.uri.sh/story/941631/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941631/?utm_source=embed&utm_campaign=story/941631' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

An Overview of the analysis

From the above visualization, an anomaly that we come across is that the dataset we are examining is supposed to be related to people with physical, mental, and learning disabilities. But unfortunately, based on the topics that were extracted, we notice just a small subset of words that are related to this topic. Topic 2 has words that address themes related to what we were expecting the dataset to have. But the major theme that was noticed in the Top 5 topics are main terms that are political. (The Top 10 topics show themes related to Religion as well, which is quite interesting.) LDA hence helped in understanding what the conversations the dataset consisted.

From the word collection, we also notice that there were certain words such as ‘kill’ that can be categorized as ‘Toxic’. To analyze this more, we can classify each word because it can be categorized wi by an NLP classifier.

To demonstrate an example of a toxic analysis framework, the below code shows the working of the Unitary library in python. [Hanu and Unitary team, 2020]

This library provides a toxicity score (from a scale of 0 to 1) for the sentence that is passed through it.

To get access to this software, you will need to get an API KEY at https://huggingface.co/unitary/toxic-bert Here is an example of what this would look like.

headers = {"Authorization": f"Bearer api_XXXXXXXXXXXXXXXXXXXXXXXXXXX"}
import requests

API_URL = "https://api-inference.huggingface.co/models/unitary/toxic-bert"

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()
query({"inputs": "addict"})
[[{'label': 'toxic', 'score': 0.9272779822349548},
  {'label': 'severe_toxic', 'score': 0.00169223896227777},
  {'label': 'obscene', 'score': 0.03694247826933861},
  {'label': 'threat', 'score': 0.0017220545560121536},
  {'label': 'insult', 'score': 0.02829463966190815},
  {'label': 'identity_hate', 'score': 0.004070617724210024}]]

You can input words or sentences in <insert word here>, in the code, to look at the results that are generated through this.

This example can provide an idea as to how ML can be used for toxicity analysis.

query({"inputs": "<insert word here>"})
[[{'label': 'toxic', 'score': 0.5101907849311829},
  {'label': 'severe_toxic', 'score': 0.07646821439266205},
  {'label': 'obscene', 'score': 0.12113521993160248},
  {'label': 'threat', 'score': 0.07763686031103134},
  {'label': 'insult', 'score': 0.11923719942569733},
  {'label': 'identity_hate', 'score': 0.09533172845840454}]]
%%html

<iframe src='https://flo.uri.sh/story/941681/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941681/?utm_source=embed&utm_campaign=story/941681' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

The Bias

The visualization shows how contextually toxic words are derived as important words within various topics related to this dataset. These toxic words can lead to any Natural Language Processing kernel learning this dataset to provide skewed analysis for the population in consideration, i.e., people with mental, physical, and learning disabilities. This can lead to very discriminatory classifications.

An Example

To illustrate the impact better, we will be taking the most associated words to the word ‘mental’ from the results. Below is a network graph that shows the commonly associated words. It is seen that words such as ‘Kill’ and ‘Gun’ appear with the closest association. This can lead to the machine contextualizing the word ‘mental’ to be associated with such words.

%%html
<iframe src='https://flo.uri.sh/visualisation/6867000/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6867000/?utm_source=embed&utm_campaign=visualisation/6867000' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

It is hence essential to be aware of the dataset that is being used to analyze a specific population. With LDA, we were able to understand that this dataset cannot be used as a good representation of the disabled community. To bring about a movement of unbiased AI, we need to perform such preliminary analysis and more not to cause unintended discrimination.

The Dashboard

Below is the complete data visualization dashboard of the topic analysis. Feel feel to experiment and compare various labels to your liking.

%%html

<iframe src='https://flo.uri.sh/visualisation/6856937/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6856937/?utm_source=embed&utm_campaign=visualisation/6856937' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>

Thank you!

We thank you for your time!