Content is available and accessible everywhere these days! A study from Nielsen found that adult Americans spend over 11 hours a day reading, listening, watching and interacting with media, which may even be higher now with so many individuals stuck at home. With the influx of content available, it might make you wonder: Is there a quantitative way to take a closer look at the text available to us?
Text mining, also known as text data mining, is the process of deriving high-quality information from text. The ultimate goal is to extract numeric measures from a text variable that can be used in quantitative modeling.
Why is Text Mining Important?
Text mining can be used to find simple patterns or much more complex sentiment analysis. Basic statistics can be used for simple analyses like counting the number of times a word is mentioned or capturing the number of words in all capital letters.
Once you capture the summary statistics, you can use visualizations like bar charts to show the most frequently occurring words graphically or word clouds to show a powerful image of them. This is particularly helpful if you want to get a sense the feelings and attitudes around a product or process.
Good news! You can tap into text mining as it's now available with the new Python Integration in the latest version of Minitab Statistical Software.
Bringing Text to Life: Tapping into Wine Reviews and Inverse Document Frequency
For illustration purposes, let’s use a simple example of analyzing five different reviews about a certain type of wine. By running the analysis through Minitab using a call to Python, you can get a very easy to read table of the summary statistics, that looks like this:
As you can see, out of the five reviews, the word “wine” appeared three times while the word “love” appeared twice with all the other words appearing only once. Minitab also provides the Inverse Document Frequency (IDF) for each word which is calculated as follows:
IDF = ln (N/DF)
With N = the number of observations (in this case five for the five total reviews) and DF = the number of documents where a given word occurs.
Mathematically speaking, when a word is present in all observations, it will have an IDF = 0. Therefore, the word with lowest IDF is the most present, whereas a word that is present in only one observation has the largest possible IDF.
In this case, it is clear that wine has the lowest IDF because it is present the most. Based on these summary statistics, we can conclude that more people love the wine than not, and in general, the reviews are positive.
For those of us that are more visual people, we can also see this sample analysis in the word cloud:
As you can see, wine is the most common and therefore largest word, but glancing at the word cloud will give you a positive sense from the overall reviews.
Try It For Yourself
Text mining is implemented using the new Python connectivity available in Minitab. Don't worry if you have never used Python before — we supply the Python installation and usage instructions (find everything you need to know about Python integration here). Once the extension has been successfully installed, it's easy to continue executing standard text mining tasks in Minitab.
Want to learn to do more with Python in Minitab? Check out our help example or talk to Minitab for more advanced work like sentiment analysis, bag of words, and latent semantic analysis!