Visualizing and Analyzing Jigsaw¶
import pandas as pd
import re
import numpy as np
In the previous section, we explored how to generate topics from a textual dataset using LDA. But how can this be used as an application?
Therefore, in this section, we will look into the possible ways to read the topics as well as understand how it can be used.
We will now import the preloaded data of the LDA result that was achieved in the previous section.
df = pd.read_csv("https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/topics.csv")
df.head()
Unnamed: 0 | Topic 0 words | Topic 0 weights | Topic 1 words | Topic 1 weights | Topic 2 words | Topic 2 weights | Topic 3 words | Topic 3 weights | Topic 4 words | ... | Topic 5 words | Topic 5 weights | Topic 6 words | Topic 6 weights | Topic 7 words | Topic 7 weights | Topic 8 words | Topic 8 weights | Topic 9 words | Topic 9 weights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | trump | 3452.3 | mental | 3351.9 | canada | 591.5 | mental | 1186.5 | gun | ... | school | 840.5 | mental | 1058.1 | white | 1220.1 | mental | 1836.1 | god | 954.9 |
1 | 1 | presid | 1031.5 | ill | 1993.1 | muslim | 582.0 | peopl | 708.3 | mental | ... | kid | 723.0 | comment | 848.3 | peopl | 1076.2 | peopl | 1793.0 | one | 934.0 |
2 | 2 | vote | 813.8 | health | 1213.7 | countri | 539.3 | drug | 555.8 | peopl | ... | year | 590.5 | like | 678.6 | black | 651.0 | health | 1464.6 | women | 905.2 |
3 | 3 | like | 780.9 | medic | 706.8 | us | 519.8 | ill | 538.9 | law | ... | go | 514.7 | would | 668.2 | disord | 537.1 | homeless | 1367.5 | life | 830.1 |
4 | 4 | elect | 579.5 | http | 630.5 | world | 490.3 | health | 497.7 | kill | ... | time | 507.9 | think | 650.4 | person | 529.5 | care | 1296.8 | peopl | 798.2 |
5 rows × 21 columns
We will visualize these results to understand what major themes are present in them.
%%html
<iframe src='https://flo.uri.sh/story/941631/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941631/?utm_source=embed&utm_campaign=story/941631' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>
An Overview of the analysis¶
From the above visualization, an anomaly that we come across is that the dataset we are examining is supposed to be related to people with physical, mental, and learning disabilities. But unfortunately, based on the topics that were extracted, we notice just a small subset of words that are related to this topic. Topic 2 has words that address themes related to what we were expecting the dataset to have. But the major theme that was noticed in the Top 5 topics are main terms that are political. (The Top 10 topics show themes related to Religion as well, which is quite interesting.) LDA hence helped in understanding what the conversations the dataset consisted.
From the word collection, we also notice that there were certain words such as ‘kill’ that can be categorized as ‘Toxic’. To analyze this more, we can classify each word because it can be categorized wi by an NLP classifier.
To demonstrate an example of a toxic analysis framework, the below code shows the working of the Unitary library in python. [Hanu and Unitary team, 2020]
This library provides a toxicity score (from a scale of 0 to 1) for the sentence that is passed through it.
To get access to this software, you will need to get an API KEY at https://huggingface.co/unitary/toxic-bert Here is an example of what this would look like.
headers = {"Authorization": f"Bearer api_XXXXXXXXXXXXXXXXXXXXXXXXXXX"}
import requests
API_URL = "https://api-inference.huggingface.co/models/unitary/toxic-bert"
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
query({"inputs": "addict"})
[[{'label': 'toxic', 'score': 0.9272779822349548},
{'label': 'severe_toxic', 'score': 0.00169223896227777},
{'label': 'obscene', 'score': 0.03694247826933861},
{'label': 'threat', 'score': 0.0017220545560121536},
{'label': 'insult', 'score': 0.02829463966190815},
{'label': 'identity_hate', 'score': 0.004070617724210024}]]
You can input words or sentences in <insert word here>, in the code, to look at the results that are generated through this.
This example can provide an idea as to how ML can be used for toxicity analysis.
query({"inputs": "<insert word here>"})
[[{'label': 'toxic', 'score': 0.5101907849311829},
{'label': 'severe_toxic', 'score': 0.07646821439266205},
{'label': 'obscene', 'score': 0.12113521993160248},
{'label': 'threat', 'score': 0.07763686031103134},
{'label': 'insult', 'score': 0.11923719942569733},
{'label': 'identity_hate', 'score': 0.09533172845840454}]]
%%html
<iframe src='https://flo.uri.sh/story/941681/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/story/941681/?utm_source=embed&utm_campaign=story/941681' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>
The Bias¶
The visualization shows how contextually toxic words are derived as important words within various topics related to this dataset. These toxic words can lead to any Natural Language Processing kernel learning this dataset to provide skewed analysis for the population in consideration, i.e., people with mental, physical, and learning disabilities. This can lead to very discriminatory classifications.
An Example¶
To illustrate the impact better, we will be taking the most associated words to the word ‘mental’ from the results. Below is a network graph that shows the commonly associated words. It is seen that words such as ‘Kill’ and ‘Gun’ appear with the closest association. This can lead to the machine contextualizing the word ‘mental’ to be associated with such words.
%%html
<iframe src='https://flo.uri.sh/visualisation/6867000/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6867000/?utm_source=embed&utm_campaign=visualisation/6867000' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>
It is hence essential to be aware of the dataset that is being used to analyze a specific population. With LDA, we were able to understand that this dataset cannot be used as a good representation of the disabled community. To bring about a movement of unbiased AI, we need to perform such preliminary analysis and more not to cause unintended discrimination.
The Dashboard¶
Below is the complete data visualization dashboard of the topic analysis. Feel feel to experiment and compare various labels to your liking.
%%html
<iframe src='https://flo.uri.sh/visualisation/6856937/embed' title='Interactive or visual content' class='flourish-embed-iframe' frameborder='0' scrolling='no' style='width:100%;height:600px;' sandbox='allow-same-origin allow-forms allow-scripts allow-downloads allow-popups allow-popups-to-escape-sandbox allow-top-navigation-by-user-activation'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6856937/?utm_source=embed&utm_campaign=visualisation/6856937' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>
Thank you!¶
We thank you for your time!