Latent Dirichlet Allocation (LDA)

The Wonderful Wizard of Oz

The Wonderful Wizard of Oz [Baum, 1900] via https://www.gutenberg.org/ebooks/55

To start our discussion, we should introduce what Topic Modeling is and how it can be applied.

Note

“Topic modeling is a princpled approach for discovering topics from a large corpus of text documents [Liu, 2020] (pg.159).”

Already, we have few things to unpack. What are the topics? How are they defined? Do we define or does the computer? What is a large corpus? How many documents do we need?

Let’s start with a large corpus of text documents. Typically, we would have two documents 📄, five documents 📄, ten million documents 📄, can be thought of as our corpus. Yes, even 1 document 📄 can be used for topic modeling. So, defining, large corpus of text documents, can be subjective.

As specified by Liu [Liu, 2020], we can start this conversation using one of the two basic types of topic modeling. This being probabilistic Latent Dirichlet Allocation or Latent Dirichlet Allocation. For our conversation, we will be using Latent Dirichlet Allocation.

Latent Dirichlet Allocation

Pronunciation

Latent

Dirichlet

Allocation

Our pronunciation stems from a talk by David Blei who is a professor of Statistics and Computer Science at Columbia University during David’s talk “Probabilistic Topic Models and User Behavior [Blei, 2017].” The citation provides a link to original YouTube video (which is a great resource), but specifically, helpful for the pronunciation.

What is Latent Dirichlet Allocation or LDA?

LDA is an unsupervised learning model.

Note

Topic Modeling with Documents 📄

  • supervised - Our documents 📄 are pre-labeled with the given topic(s). We can then train 🏋️ and test 🧪 (and also, you can include validating). Usually this is split:

    • training 🏋️ 80%

    • testing 🧪 20%.

  • unsupervised - Data is not labeled. So, we have no idea what the topics are beforehand. That being said, we can (and will) define the number of topics.

So, coming back to our original questions:

  • What are topics?

    • The topics will X number of sets of terms (we define this X) which will (could) have a common theme.

  • How are they defined?

    • This is what we will get to in this notebook.

  • Do we define or does the computer?

    • LDA is unsupervised, so we define the number of topics. The computer provides the topics themselves.

  • What is a large corpus? and How many documents do we need?

    • A bit subjective here. There is a great discussion about this by Tang et al. [Tang et al., 2014] regarding this. If you have a chance, read all the points, but to sum it up

      • The number of documents does matter, but at some point, increasing the number does not warrant better results. Even sampling 1000 papers from 1000000 papers could result in the same, if not better, results than 1000000 documents.

      • The size of the documents also plays a role, so documents should not be short. Larger documents can be sampled and again receive the same desired output.

A Picture == 1000 Words

One of the best representations of what LDA is and how to utilize it, can be found in Blei’s work Probabilistic topic models [Blei, 2012] Please note that images and figure text come directly from work. All credit should go to Blei [Blei, 2012]

The intuitions behind latent Dirichlet allocation “Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of “topics,” which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. The topics and topic assignments in this figure are illustrative—they are not fit from real data. [Blei, 2012] (Page 3)”

Real inference with LDA “Figure 2. Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found in this article. [Blei, 2012] (Page 4)”

Let’s Try an Example

For our example, we will be using a subset of books from L. Frank Baum that are part of the public domain (again, thank you https://www.gutenberg.org).

The books are all in the public domain, and the HTML can be found at https://www.gutenberg.org/. We will go through one example of how to get the text from the book using Python. Please note, this will not be the most optimal way to do this, but we hope we can make the process clear for you to try with other books or manuscripts.

Get the HTML for the Book

We are going to use two libraries for this; one is a standard for Python called.

import urllib

the other is a favorite of ours, called beautiful soup [Richardson, 2019].

from bs4 import BeautifulSoup

urllib will get the document, and BeautifulSoup makes it easy to parse.

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.gutenberg.org/files/55/55-h/55-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

Here we remove any CSS (style) or JavaScript (script)

for script in soup(["script", "style"]):
    script.extract()

Finally, get the text and add it to our document list.

text = soup.get_text()
documents = []
documents.append(text)

We will repeat this process for the other four books.

url = "https://www.gutenberg.org/files/54/54-h/54-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/33361/33361-h/33361-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/22566/22566-h/22566-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/26624/26624-h/26624-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

Create Tokens and Vocabulary

Now that we have our books, we need to tokenize the stories by word and then create a vocabulary out of these tokens. sklearn is a fantastic library that we will be using throughout the notebook [Buitinck et al., 2013].

%%capture
!pip install sklearn
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
df = cv.fit_transform(documents)
vocab = cv.get_feature_names()

Let’s take a look at the tokens and the number of occurrence for the tokens.

print (df[0])
  (0, 8074)	3198
  (0, 6159)	89
  (0, 3803)	99
  (0, 2718)	14
  (0, 5464)	976
  (0, 9007)	28
  (0, 8988)	44
  (0, 5599)	169
  (0, 1347)	119
  (0, 3404)	5
  (0, 896)	5
  (0, 8107)	196
  (0, 4381)	284
  (0, 3334)	354
  (0, 8586)	28
  (0, 596)	18
  (0, 599)	4
  (0, 4220)	544
  (0, 8514)	15
  (0, 7641)	19
  (0, 553)	1738
  (0, 5174)	24
  (0, 5551)	65
  (0, 5670)	2
  (0, 9040)	19
  :	:
  (0, 5615)	1
  (0, 2552)	1
  (0, 5054)	1
  (0, 406)	1
  (0, 8792)	1
  (0, 2067)	1
  (0, 1411)	1
  (0, 6150)	1
  (0, 5057)	1
  (0, 3884)	1
  (0, 5540)	1
  (0, 7093)	1
  (0, 6146)	1
  (0, 4814)	1
  (0, 5326)	1
  (0, 8698)	1
  (0, 1844)	1
  (0, 5293)	1
  (0, 2728)	1
  (0, 4889)	1
  (0, 5817)	1
  (0, 3045)	1
  (0, 6145)	1
  (0, 7822)	1
  (0, 5337)	1

The second number listed is the token number, and we use the vocab list to see what the actual word. An example would be to look at the first line.

(0, 8074) 3198

The 8074 token was used 3198 times. The 8074 token is:

print (vocab[8074])
the

Not that surprising, the word “the” is used that many times.

Note

Because there are many commonly used terms. We would want to remove these words from our dataset. These words are called stopwords and should be removed. We do showcase this later.

From here, we are actually at the point where we can run LDA.

There are much more than two inputs available for LDA, but we will focus on two.

The two we will focus on are:

  • n_components - the number of topics, again, we need to specify this

  • doc_topic_prior - this relates the Dirichlet distribution (the next notebook goes into full detail about Dirichlet and how it relates to LDA.

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 4, doc_topic_prior=1)
lda.fit(df)
LatentDirichletAllocation(doc_topic_prior=1, n_components=4)

To print out the top-5 words per topic, we used a solution from StackOverflow [blacksite, 2017]

import numpy as np 
topic_words = {}
n_top_words = 10
for topic, comp in enumerate(lda.components_):
    # for the n-dimensional array "arr":
    # argsort() returns a ranked n-dimensional array of arr, call it "ranked_array"
    # which contains the indices that would sort arr in a descending fashion
    # for the ith element in ranked_array, ranked_array[i] represents the index of the
    # element in arr that should be at the ith index in ranked_array
    # ex. arr = [3,7,1,0,3,6]
    # np.argsort(arr) -> [3, 2, 0, 4, 5, 1]
    # word_idx contains the indices in "topic" of the top num_top_words most relevant
    # to a given topic ... it is sorted ascending to begin with and then reversed (desc. now)    
    word_idx = np.argsort(comp)[::-1][:n_top_words]

    # store the words most relevant to the topic
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))
Topic: 0
  the, and, to, of, in, you, it, was, that, he
Topic: 1
  id, prop, dentist, dent, printing, privilege, privileges, prize, professors, defense
Topic: 2
  id, prop, dentist, dent, printing, privilege, privileges, prize, professors, defense
Topic: 3
  id, prop, dentist, dent, printing, privilege, privileges, prize, professors, defense

Looking at this, we do not get a clear picture of the topics. This time, let’s remove those stopwords and see how important 🧼cleaning the data can be🧼!

from sklearn.feature_extraction.text import CountVectorizer

# we can add this to the tokenization step
cv = CountVectorizer(stop_words='english')
df = cv.fit_transform(documents)
vocab = cv.get_feature_names()
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 4, doc_topic_prior=1)
lda.fit(df)
LatentDirichletAllocation(doc_topic_prior=1, n_components=4)
topic_words = {}
n_top_words = 10
for topic, comp in enumerate(lda.components_):
    # for the n-dimensional array "arr":
    # argsort() returns a ranked n-dimensional array of arr, call it "ranked_array"
    # which contains the indices that would sort arr in a descending fashion
    # for the ith element in ranked_array, ranked_array[i] represents the index of the
    # element in arr that should be at the ith index in ranked_array
    # ex. arr = [3,7,1,0,3,6]
    # np.argsort(arr) -> [3, 2, 0, 4, 5, 1]
    # word_idx contains the indices in "topic" of the top num_top_words most relevant
    # to a given topic ... it is sorted ascending to begin with and then reversed (desc. now)    
    word_idx = np.argsort(comp)[::-1][:n_top_words]

    # store the words most relevant to the topic
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))
Topic: 0
  55, cheer, roared, contentedly, cackling, heaps, 100, enchantments, attractive, growling
Topic: 1
  dorothy, said, man, little, scarecrow, oz, king, asked, tin, shaggy
Topic: 2
  said, horse, scarecrow, wizard, tip, pg, dorothy, saw, boy, asked
Topic: 3
  55, cheer, roared, contentedly, cackling, heaps, 100, enchantments, attractive, growling

Much better!

Moving On

In the next section, we spend a reasonable amount of time talking about the Dirichlet distribution and how it relates to LDA.