Basic text analysis on unigram and bigram frequencies can be helpful when digging into datasets of unstructured text. The most frequent bigrams, or pairs of adjacent words, tell you which phrases are most common in your corpus.
We’ll use bigrams to find the most common phrases from users in our user_comments table.
Simple Lists of Words
The first step in making our bigrams is to convert our paragraphs of text into lists of words. We could use the handy regexp_split_to_table function like this:
select regexp_split_to_table( lower(comments), E'[^a-z0-9_]+' ) from user_comments order by id
The E'[^a-z0-9_]+’ regular expression parameter lets us split the comments on anything that isn’t a letter, number, or underscore. This takes care of punctuation and differences in spacing, helping to clean up the data.
Arrays of Words
Unfortunately, we cannot use regexp_split_to_table because it doesn’t give us a way to keep the words in order, which will be critical for constructing the bigrams later on. Instead we’ll convert the comments into arrays, and then work up to an ordered lists of words.
Making the comments into arrays or words is straightforward (we’ll be building on this CTE):
with word_list as ( select id as comment_id, string_to_array( regexp_replace( lower(comment), E'[^a-z0-9_]+', ' ', 'g'), ' ') as word_array from user_comments )
Then we use string_to_array with a space as its split parameter to convert the cleaned comments into arrays. At the same time we’ll select the id of the original comment as that will be helpful later.
Ordered Lists of Words
Now that we have our comments as arrays, we can break them out into rows and keep the order:
word_indexes as ( select comment_id, word_array, generate_subscripts(word_array, 1) as word_id from word_list )
We’re using generate_subscripts to output one row for each word in the array containing the index of the word. It’s just a number, not the word itself, so we need to bring the word_array and comment_id values along for the ride.
Then we’ll use the array indexes outputted by generate_subscripts to pull out the word for each index:
numbered_words as ( select comment_id, word_array[word_id] word, word_id from word_indexes )
Now we have one line for each word containing its original comment_id, the word itself, and word_id, the word’s position within the array (and also the original comment).
From here it’s easy to make bigrams: we only need to join numbered_words to itself for each comment!
select nw1.word, nw2.word from numbered_words nw1 join numbered_words nw2 on nw1.word_id = nw2.word_id - 1 and nw1.comment_id = nw2.comment_id
Notice how we joined each comment on itself (nw1.comment_id = nw2.comment_id) since bigrams cannot span comments. And joining adjacent words is simply making sure their positions within the array are off by one: nw1.word_id = nw2.word_id – 1.
And with this list of bigrams, adding in the count(1) and group by gives us our bigram frequencies:
select nw1.word, nw2.word, count(1) from numbered_words nw1 join numbered_words nw2 on nw1.word_id = nw2.word_id - 1 and nw1.comment_id = nw2.comment_id group by 1, 2 order by 3 desc
With these bigram frequencies you’ll be able to see which phrases are most frequent in your data!