Introduction to the Notebook


We are utilizing Jupyter Books [Community, 2020] for our development. We did not have any part in the development of Jupyter Books (but we sure do thank them), but our contribution is the content contained in this notebook.

Welcome to our introduction and application of latent dirichlet allocation or LDA [Blei et al., 2003]. Our hope with this notebook is to discuss LDA in such a way as to make it approachable as a machine learning technique. From “when to use LDA” to “applying LDA to talk about bias,” we tried our best to cover the topic in an approachable manner. If we are missing anything, feel free to click on the GitHub Logo button at the top-right side of the page.


Notebook Introduction - Provides details on how to run this Jupyter Notebook in Binder, Google Colab, or even in the browser itself.

Latent Dirichlet Allocation (LDA) - Introduces the topic modeling and LDA. Including an example of its application using Python

Dirichlet Distribution - We provide a look at the Dirichlet Distribution using The Chinese Restaurant Process to illistrate how it is derived and used in LDA.

Jigsaw - an Implementation of LDA - We wanted to provide a use-case for LDA, so we coupled LDA and Unintended Bias (a dataset from Kaggle)

Visualizing and Anayzing Jigsaw - Finally, we take the results from LDA + Jigsaw and provide visualization and analysis of the findings.


I know it is tradition to have references at the end of books, but when you are standing on the shoulders of giants. You thank them first.


David J Aldous. Exchangeability and related topics. In École d'Été de Probabilités de Saint-Flour XIII—1983, pages 1–198. Springer, 1985.


L. Frank Baum. The Wonderful Wizard of Oz. George M. Hill Company, 1900. URL:


blacksite. 2017. Accessed on 2021-08-06T18:24:07Z. URL:


David Blei. Prof. David Blei - Probabilistic Topic Models and User Behavior. YouTube, Feb 2017. URL:


David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.


David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.


Aja Bogdanoff. Saying goodbye to civil comments. Dec 2017. URL:


Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D³ data-driven documents. IEEE transactions on visualization and computer graphics, 17(12):2301–2309, 2011.


Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108–122. 2013.


Executable Books Community. Jupyter book. Feb 2020. doi:10.5281/zenodo.4539666.


Laura Hanu and Unitary team. Detoxify. Github., 2020.


Jigsaw and Google. Jigsaw unintended bias in toxicity classification. 2019. URL:


Bing Liu. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge university press, 2020.


Panupong Pasupat. Dp: chinese restaurant process viewpoint. Jan 2021. URL:


Leonard Richardson. 2019. URL:


Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning, 190–198. PMLR, 2014.


Wayne Werner., 2016. Accessed on 2021-08-04T13:42:07Z.