Early Modern Thought Online: The Blog

"Early Modern Thought Online" (EMTO) is a database offering access to about 13.500 digitized source texts from early modern philosophy and related disciplines like history of science and history of theology provided by libraries in Europe and overseas. In the present stage of its development, EMTO presents mainly links to external resources. This blog intends to show how to profit from concepts and methods of the digital humanities. It will give practical advice on how to use digitised sources. We will present digital collections relevant to our field, and discuss their relevance for early modern philosophy and history of ideas. But we want to do philosophy as well: present ongoing research related to sources present in EMTO. We hope that this blog, as well as EMTO as a whole, will be a helpful tool and provide a lively forum for discussion. EMTO is on Twitter, Facebook and Google+. Original content is licensed via the CC by-nc-sa 3.0 license.



computatiohumanitatis:

Visualizing Topic Models (Ted Underwood via the Stone and the Shell)
Posted November 11, 2012
How do you visualize a whole topic model? It’s easy to pull out a single topic and visualize it — as a word cloud, or as a frequency distribution over time. But it’s also risky to focus on a single topic, because in LDA, the boundaries between topics are ontologically sketchy.
After all, LDA will create as many topics as you ask it to. If you reduce that number, topics that were separate have to fuse; if you increase it, topics have to undergo fission. So it can be misleading to make a fuss about the fact that two discourses are or aren’t “included in the same topic.” (Ben Schmidt has blogged a nice example showing where this goes astray.) Instead we need to ask whether discourses are relatively near each other in the larger model.
But visualizing the larger model is tricky. The go-to strategy for something like this in digital humanities is usually a network graph. I have some questions about that strategy, but since examples are more fun than abstract skepticism, I should start by providing an illustration. The underlying topic model here was produced by LDA on the top 10k words in 872 volume-length documents. Then I produced a correlation matrix of topics against topics. Finally I created a network in Gephi by connecting topics that correlated strongly with each other (see the notes at the end for the exact algorithm). Topics were labeled with their single most salient word, except in three cases where I changed the label manually. The size of each node is roughly log-proportional to the number of tokens in the topic; nodes are colored to reflect the genre most prominent in each topic. (Since every genre is actually represented in every topic, this is only a rough and relative characterization.) (via Visualizing topic models. | The Stone and the Shell)

computatiohumanitatis:

Visualizing Topic Models (Ted Underwood via the Stone and the Shell)

Posted November 11, 2012

How do you visualize a whole topic model? It’s easy to pull out a single topic and visualize it — as a word cloud, or as a frequency distribution over time. But it’s also risky to focus on a single topic, because in LDA, the boundaries between topics are ontologically sketchy.

After all, LDA will create as many topics as you ask it to. If you reduce that number, topics that were separate have to fuse; if you increase it, topics have to undergo fission. So it can be misleading to make a fuss about the fact that two discourses are or aren’t “included in the same topic.” (Ben Schmidt has blogged a nice example showing where this goes astray.) Instead we need to ask whether discourses are relatively near each other in the larger model.

But visualizing the larger model is tricky. The go-to strategy for something like this in digital humanities is usually a network graph. I have some questions about that strategy, but since examples are more fun than abstract skepticism, I should start by providing an illustration. The underlying topic model here was produced by LDA on the top 10k words in 872 volume-length documents. Then I produced a correlation matrix of topics against topics. Finally I created a network in Gephi by connecting topics that correlated strongly with each other (see the notes at the end for the exact algorithm). Topics were labeled with their single most salient word, except in three cases where I changed the label manually. The size of each node is roughly log-proportional to the number of tokens in the topic; nodes are colored to reflect the genre most prominent in each topic. (Since every genre is actually represented in every topic, this is only a rough and relative characterization.) (via Visualizing topic models. | The Stone and the Shell)