class: center, middle, inverse, title-slide # Quanlify with ease: Combining quantitative and qualitative corpus analysis ### Andreas Blaette ### SSHOC Webinar | April 16, 2020 --- # Corpus analysis: An emerging field. ## What are the challenges? -- There is no lack ... - of algorithms - of ideas and projects on research data management (implementing FAIR data principles) But ... -- * Acquisition of NLP techniques in the social sciences & humanities still piecemeal -- * Tools for processing text that scale well are just emerging -- * Availability of data for replication is missing -- * Reproducibility of data (getting FAIRER) is wanting -- * Integration of quantitative and qualitative approaches to text is technically difficulat --- # The PolMine Project (www.polmine.de) ## Research - Data - Code - Tutorials - Centre -- ### Corpora - **GermaParl**: Corpus of the German Bundestag, regional parliaments  - **UNGA** Verbatim Records of the United Nations General Assembly  - **MigParl**: Debates on migration and integration in Germany's Regional Parliaments - **MigPress**: Corpus of reports on migration and integration in Süddeutsche Zeitung und Frankfurter Allgemeine Zeitung (2000-2019) -- *(Toolchain for corpus preparation: **frapp**, **bignlp**, **biglda**, **trickypdf**)* -- ### Packages for corpus analysis - **polmineR**: Elementary vocabulary for corpus analysis  - **RcppCWB**: R Wrapper for the C Code of the Corpus Workbench (using C++/Rcpp)  - **cwbtools**: Tools to create and manage CWB indexed corpora  --- # Things are evolving. ## But where do we stand? -- * Acquisition of NLP techniques in the social sciences & humanities:<br/> _Working with large-scale, linguistically annotated corpora_ -- * Tools for processing text that scale well are just emerging:<br/> _Packages such as bignlp, biglda_ -- * Availability of data for replication is missing:<br/> _Depositing data with Zenodo (or other open science repositories)_ -- * Reproducibility of data (getting FAIRER):<br/> _100% reproducibility of data_ -- * <span style="background-color:yellow">Integration of quantitative and qualitative approaches to text is an unfulfilled promise</span> --- # Assumption, Problem, Vision and Plan **Assumption:** The validity of our research with large-scale corpora depends on our ability to combine the quantitative and the qualitative analysis of textual data. -- **Problem:** It takes a software engineer to implement an integrated environment for the quantitative and qualitative analysis of text. You need funding to entertain the cooperation with a software engineer. Funding ends. Solutions are not sustainable. -- **Vision:** Wouldn't it be great to have an open source, modular toolset that can flexibly be used by ordinary computational social scientists to implement "quanlitative" workflows? -- **Plan of this Talk** 1. Theory is Code - The "Quanlification" Frontier 2. Quanlification as a Matter of Design 3. Implementing Quanlifictation 4. Work Ahead - Getting Things Done 5. Discussion ??? We do not want to get rid of the collaboration with the guys from computer science. But interdisciplinary constellations should be based on a demand of what they find interesting. Usually, that's not writing GUIs for other people. --- background-image: url(img/Le_Penseur_min.jpg) background-size: cover class: nobackground, inverse .attribution[Photo: [Daniel Stockman](https://www.flickr.com/photos/84989911@N00/4694248310)] # Theory is Code ## The "Quanlification" Frontier --- # From text to numbers ## The idea of "distant reading" - "[…] the trouble with close reading […] is that it necessarily depends on an extremely small canon. […] what we really need is a little pact with the devil: we know how to read texts, so now let‘s learn how not to read them. Distant reading, where distance, let me repeat is, is a condition of knowledge. It allows you to focus on units that are much smaller or much larger than the text: devices, themes, types – or genres and systems. And if, between the very small and the very large, the text itself disappears, well, this is one of the cases where one can justifiably say, Less is more.“ (Moretti [2000] 2013: 49) -- ## The "text as data" movement: - scaling party positions (wordscore and wordfish) as a driver -"[…] while our method is designed to analyse the content of a text, it is not necessary for an analyst using the technique to understand, or even read, the texts to which the technique is applied. " (Laver, Benoit & Garry 2003) --- # Text + Numbers = Quanlification ## An obsolete methodological divide? -- .pull-left[ ### Quantity - Text as Data - Natural Language Processing (NLP) - Data / Text Mining - Machine Learning (ML) - "Validate, validate, validate" (Grimmer et al. 2013) ] -- .pull-right[ ### Quality - eHumanities / Digital Humanities - Corpus Linguistics - Computational Linguistics - "blended reading" (Stulpe, Lemke 2015), "scalable reading" (Weitin 2017) & related concepts ] -- ### Quanlification - Epistemological disputes notwithstanding: The necessity to combine qualitative and quantitative approaches to text is conceptually undisputed. - Software inhibits combining quantity and quality: Tools are there, but setting up a quanlitative project is expensive: Difficult without a dedicated software engineer --- background-image: url(img/clipart1299822.png) background-size: cover class: nobackground, inverse .attribution[[www.clipartmax.com](https://www.clipartmax.com/middle/m2i8N4N4K9m2H7b1_room-and-pillar-mining-room-and-pillar-mining/)] # A Matter of Design ## Data Structures Matter --- # Three-Tier Architecture: C & R & More  --- # Verbs and nouns for corpus analysis ## polmineR: A basic vocabulary for quanlification - __Corpora and subcorpora__ - corpus objects: *corpus()* - subsetting corpora: *partition()* / *subset()* -- - __Quantification__ - counting: *hits()*, *count()*, *dispersion()* (and *size()*) - cooccurrences: *cooccuurrences()*, *Cooccurrences()* - feature extraction: *features()* - term-document-matrices: *as.sparseMatrix()*, *as.TermDocumentMatrix()* -- - __Qualitative analysis__ - Keywords-in-context / concordances: *kwic()* - full text (of a subcorpus): *get_token_stream()*, *as.markdown()*, *as.html()*, *read()* --- # polmineR - the People's Corpus Miner -- - Prerequisites: - Any kind of computer that still has keys (Windows, Linux, macOS) - Installation of R/RStudio -- - Three lines of code will get you polmineR and GermaParl (or any other corpus) ```r install.packages("polmineR") install.packages("GermaParl") # includes small sample dataset GermaParl::germaparl_download_corpus() # get the full corpus (1 GB) ``` -- - Start your inquiry ```r library(polmineR) count("GERMAPARL", query = "Europa") kwic("GERMAPARL", query = "Europa") ``` - Installation options: Local install, or R, RStudio, OpenCPU on server (remote corpus access) --- background-image: url(img/Nerd.jpg) background-size: cover class: nobackground, inverse # Implementing Quanlification ## The Look and Feel --- # Reading Anywhere: 'fulltext' htmlwidget ## Problem Statement - **Read Anything:** Cooccurrences, concordances, subcorpora, topic models - you want to contextualize all of it - **Read Anywhere:** Include fulltext output into (html) documents and slides, and in GUIs ## Implementation - polmineR: Implementation of a `read()`-method - GUI: Package 'fulltext' that renders input data into an "htmlwidget" (a truly flexible device) ## Demo - HTML documents with scrollable fulltext [-> example](https://ablaette.github.io/sshoc_webinar_slides/gallery/fulltext_doc.html) - Slides with `fulltext` htmlwidget [-> example](https://ablaette.github.io/sshoc_webinar_slides/gallery/fulltext_slides.html) - polmineR shiny App [-> example](http://34.209.39.169:3838/polmineR/) --- # Highlighting and Tooltipping ## Problem Statement - **Highlighting with multiple colors:** The statistical analysis of text yields variable dictionaries with word weights - visualising multiple dictionaries at the same time will help to gather the semantic sense of numeric analyses - **Tooltipping:** Colors alone may be misleading. Show numeric information on demand ## Implementation - polmineR package: Implementation of methods `highlight()` and `tooltips()` - GUI: Enrich input data for htmlwidgets and amend CSS ## Demo - Validating sentiment analyses with interactive KWIC tables [-> example](https://polmine.github.io/ValidationWorkflows/sentiment_analysis.html#17) - Evaluate topic models with flexdashboard [-> example](https://polmine.github.io/gallery/topicmodel_flexdashboard.html) - Understanding the data behind cooccurrence graphs [-> example](https://ablaette.github.io/sshoc_webinar_slides/gallery/unga_gradget.html) --- # Blei 2012: Intuition behind LDA  --- # Annotation ## Problem Statement - **Annotation and intersubjectivity**: In the qualitative research tradition, annotations are a means to communicate evaluative decisions to other researchers - **Annotation and machine learning** Annotations (generating labelled data) are a precondition for machine learning ## Implementation - polmineR level: Implementation of `edit()`-methods - Htmlwidget with annotation functionality (different from `fulltext` htmlwidget, as it returns values) ## Demo - Annotate any class inheriting from the `textstat` class (i.e. tables - kwic, cooccurrences, features) [-> example](http://34.209.39.169:8787) - Simple text annotation with the `annolite` package - Annotate cooccurrence graphs [-> example](https://ablaette.github.io/sshoc_webinar_slides/gallery/unga_gradget.html) --- # Annotation - Code Example: Annotate a cooccurrences object ```r library(polmineR) s <- cooccurrences("UNGA", "sustainability") # could also be kwic etc. annotations(s) <- list(name = "annotation", what = "") edit(s) ``` - Code Example: Annotate fulltext ```r library(polmineR) library(annolite) # at github.com/PolMine/annolite, dev-branch data <- corpus("GERMAPARLMINI") %>% subset(speaker == "Volker Kauder") %>% subset(date == "2009-11-10") %>% as("partition") %>% as.fulltextdata(headline = "Volker Kauder (CDU)") anno <- annotate(data) ``` --- background-image: url(img/318_3.jpg) background-size: cover class: nobackground, inverse .attribution[[www.runder-tisch.info](http://www.runder-tisch.info/bergleute.php?bergleute=2)] # Work Ahead ## Getting Things Done --- # The Nature of the Beast ## A modular toolset (not a framework) - turn whatever is already there into tools for quanlification - set of htmlwidgets (crosstalk enabled) as elements - shiny modules, shiny apps, and shiny gadgets - flexdashboards - Rmarkdown templates (for documents, slides, flexdashboards) -- ## Skills required - Know when to use what (presenting research results is different from research) - It will take sound documentation, tutorials, recipes to illuminate the toolset! --- # A Suite of R Packages for Quanlification ## Conscious Uncoupling and Modularization - [*fulltext*](https://github.com/PolMine/fulltext): Toolset to generate fulltext display from corpus data<br/> [](https://www.gnu.org/licenses/gpl-3.0) [](https://www.tidyverse.org/lifecycle/#experimental) [](https://travis-ci.org/PolMine/fulltext) [](https://ci.appveyor.com/project/PolMine/fulltext) [](https://codecov.io/gh/PolMine/fulltext/branch/master) - [*gradget*](https://github.com/PolMine/gradget): Graph annotation widget<br/> [](https://www.tidyverse.org/lifecycle/#experimental) [](https://travis-ci.org/PolMine/gradget) - [*annolite*](https://github.com/PolMine/annolite): Leightweight Fulltext Display and Annotation Tools<br/> [](https://www.tidyverse.org/lifecycle/#experimental) [](https://travis-ci.org/PolMine/annolite) [](https://codecov.io/gh/PolMine/annolite/branch/dev) [](https://ci.appveyor.com/project/PolMine/annolite) - [*topicanalysis*](https://github.com/PolMine/topicanalysis): Auxiliary functions for topicmodelling.<br/> [](https://www.tidyverse.org/lifecycle/#maturing) [](https://www.gnu.org/licenses/gpl-3.0) [](https://travis-ci.org/PolMine/topicanalysis) [](https://ci.appveyor.com/project/PolMine/topicanalysis) [](https://codecov.io/gh/PolMine/topicanalysis/branch/master) - [*quanlify*](https://github.com/PolMine/quanlify):Toolset for the qualitative validation of quantitative text analysis<br/> [](https://www.tidyverse.org/lifecycle/#experimental) **Beware! All of this is experimental!** --- # Discussion. Frontier. ## Vision - Offer a very flexible set of tools to implement all kinds of workflows that entail distant and close reading with minimal cost - A **l**eightweight **i**nfrastructure for the **qu**anlitat**i**ve analysis of text **d**ata ("liquid") - Towards a people's framework for quanlification -- ## Discussion Points - Alternative approaches, relevant previous work - Balance between GUI and console? - Will there be users? - How to build a community? - Role for EOSC/SSHOC?