Finding significant words within a curated dataset¶

This notebook demonstrates how to find the significant words in your dataset using a model called TF-IDF, which stands for term frequency-inverse document frequency. TF-IDF encompasses both ‘aboutness’ and ‘uniqueness’ by looking at how often a term appears in a document (=‘aboutness’) and how unusual it is for that same term to appear in other documents in your dataset (=‘uniqueness’).

Essentially, TF-IDF attempts to determine which words are distinctive in individual documents, relative to a larger corpus. The idea is that words that are common within a particular document and uncommon within the other documents you’re comparing it to can tell you something significant about the language in a document.

Fun fact: TF-IDF was used in early search engines to do relevance ranking, until clever folks figured out how to break that by ‘keyword stuffing’ their websites.

As you work through this notebook, you’ll take the following steps:

Import a dataset
Find the query used to build the dataset within the dataset’s metadata
Write a helper function to help clean up a single token
Clean each document of your dataset, one token at a time
Compute the most significant words in your corpus using TF-IDF and a library called gensim

To perform this analysis, we’ll need to treat the words in our dataset as tokens. What’s a token? It’s a string of text. For our purposes, think of a token as a single word. Calculating tokens and counting words can get very complicated (for example, is “you’re” one word or two? what about “part-time”?), but we’ll keep things simple for this lesson and treat “tokens” as strings of text separated by white space.

First we’ll import gensim, and the tdm_client library. The tdm_client library contains functions for connecting to the JSTOR server that contains our corpus dataset. As a reminder, functions are pieces of code that perform specific tasks, and we can get additional functions by installing libraries like gensim and tdm_client. We’ll also import a library that will suppress unnecessary warnings.

import warnings
warnings.filterwarnings('ignore')

import gensim

import tdm_client

Next we’ll pull in our datasets.

Did you build your own dataset? In the next code cell, you’ll supply the dataset ID provided when you created your dataset. Now’s a good time to make sure you have it handy.

Didn’t create a dataset? Here are a couple to choose from, with dataset IDs in red :

Documents published in African American Review, Black American Literature Forum, and Negro American Literature Forum (from JSTOR): b4668c50-a970-c4d7-eb2c-bb6d04313542
‘Civilian Conservation Corps’ from Chronicling America: 9fa82dbc-9269-6deb-9720-179b4ba5e451
‘children’ from Documenting the American South: c002acf3-6d4a-c4c8-7c7c-a173dfdaec39

A sample dataset ID of data derived from searching JSTOR for ‘Civilian Conservation Corps’ (‘e2a07be0-39f4-4b9f-b3d1-680bb04dc580’) is provided in the code cell below.

Pasting your unique dataset ID in place of the red text below will import your dataset from the JSTOR server. (No output will show.) To frame this in terms of our introduction to Python, running the line of code below will create a new variable called dataset_id whose value is the ID for the dataset you’ve chosen.

# Initializes the "dataset_id" variable
dataset_id = 'e2a07be0-39f4-4b9f-b3d1-680bb04dc580'

Now, we want to create one more variable for storing information about our dataset, using a function from tdm_client called get_description():

# Initializes the "dataset_info" variable
dataset_info = tdm_client.get_description(dataset_id)

Let’s double-check to make sure we have the correct dataset. We can confirm the original query by viewing the search_description.

dataset_info["search_description"]

Using our new variable, we can also find the total number of documents in the dataset:

dataset_info["num_documents"]

We’ve verified that we have the correct dataset, and we’ve checked how many documents are in there. So, now let’s pull the dataset into our working environent. This code might take a minute or two to run; recall that if it’s in process, you’ll see this: In [ * ].

dataset_file = tdm_client.get_dataset(dataset_id)

Next, let’s create a helper function that can standardize and clean up the tokens in our dataset. The function will:

Change all tokens (aka words) to lower case. This will make ‘otter’ and ‘Otter’ be counted as the same token.
Discard tokens with non-alphabetical characters
Discard any tokens less than 4 characters in length

Question to ponder: Why do you think we want to discard tokens that are less than 4 characters long?

def process_token(token): # Defines a function `process_token` that takes the argument `token`
    token = token.lower() # Changes all strings to lower case
    if len(token) < 4: # Discards any tokens that are less than 4 characters long
        return
    if not(token.isalpha()): # Discards any tokens with non-alphabetic characters
        return
    return token # Returns the `token` variable which has been set equal to the `corrected` variable

Now let’s cycle through each document in the corpus with our helper function. This may take a while to run. After we run the code below, we will have a list called documents which will contain the cleaned tokens for every text in our corpus. The print() function at the end isn’t necessary to create this list of documents and their tokens, but it gives us some output to confirm when the code has been run.

# This is a bit of setup that establishes how the documents will be read in
reader = tdm_client.dataset_reader(dataset_file)

# Creates a new variable called `documents` that is a list that that will contain all of our documents.
documents = []

# This is the code that will actually read through your documents and apply the cleaning routines
for n, document in enumerate(reader):
    this_doc = []
    _id = document["id"]
    for token, count in document["unigramCount"].items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append((_id, this_doc))
print("Document tokens collected and processed")

To get a sense of what we’ve just created, you can run the line of code below. It will show the tokens that have been collected for the first text in documents (Python starts counting at zero). If you want to see the tokens from a different document, you can change the number in the brackets.

documents[0]

The next few lines will set us up for TF-IDF analysis. First, let’s create a gensim dictionary; this is a masterlist of all the tokens across all the documents in our corpus, in which each token gets a unique identifier. This doesn’t have information on word frequencies yet; it’s essentially a catalog listing the unique tokens in our corpus and giving each its own identifier.

dictionary = gensim.corpora.Dictionary([d[1] for d in documents])

The next step is to connect the word frequency data found within documents to the unique identifiers from our gensim dictionary. To do this, we’ll create a bag of words corpus: this is a list that contains information on both word frequency and unique identifiers for the tokens in our corpus.

bow_corpus = [dictionary.doc2bow(doc) for doc in [d[1] for d in documents]]

The next step is to create the gensim TF-IDF model, which will control how TF-IDF is calculated:

model = gensim.models.TfidfModel(bow_corpus)

Finally, we’ll apply our model to the bow_corpus to create our results in corpus_tfidf, which is a list of each document in our dataset, along with the TF-IDF scores for the tokens in the document.

corpus_tfidf = model[bow_corpus]

Now that we have those pieces in place, we can run the following code cells to find the most distinctive terms, by TF-IDF, in our dataset. These are terms that appear in a particular text, but rarely or never appear in other texts. First, we’ll need to set up a dictionary with our documents and then we’ll need to sort the items in our new dictionary. We won’t get any output from this code; it will just set things up. This will also take some time to run.

td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

Let’s see what we got! The code below will print out the top 20 most distinctive terms in the entire corpus, with their TF-IDF scores. If you want to see more terms, you can change the number in the square brackets.

for term, weight in sorted_td[:20]:
    print(term, weight)

It can also be useful to look at individual documents. This code will print the most significant word, by TF-IDF, for the first 20 documents in the corpus. Do you see any interesting or surprising results?

for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(documents[n][0], dictionary.get(word_id), score)
    if n >= 20:
        break

Want to keep going? There are many additional text analysis and Python notebooks at Constellate.org.

Adapted by Jen Ferguson from notebooks created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License; lightly modified by Sarah Connell in summer 2021. See here for the originals and additional text analysis notebooks.

Python and Text Analysis for Absolute Beginners

Finding significant words within a curated dataset¶