Power BI — 3 Data Preparation Tips to Enhance your Word Cloud

The Microsoft Power BI team recently unveiled a new Word Cloud custom visual for the Gallery. Read the formal announcement and see the short video by Amir Netz.

As Amir mentions, word clouds can be used to quickly display unstructured text. While they fall short in terms of precision, the ability to roughly judge frequency based on size can be helpful for initial text exploration. The current Power BI visual offers options to control the maximum number of words, “word breaking”, font size, rotation and stop words. Color and font options are curiously absent. For more detail on how these options work, see Amir’s video.

From a casual perspective, the Power BI word cloud will be enough to get a sense of your text, create some data art, and move on. For more serious text analysis, it is worth creating a wordcloud with R in Power BI or avoiding it in favor of a more precise method. R offers packages for text mining and a more robust word cloud (check out tm and wordcloud). For the large majority of Power BI users though, the new Word Cloud custom visual will work fine. Here’s a test: if you have never heard of stemming, you are free to enjoy the Power BI word cloud to the full.

Problems with Raw Text

As with any data work, there is basic wrangling and reshaping that often needs to take place. With text data, this is especially true. In Amir’s example, he pulled sample text from tweets about Power BI into the following word cloud.

Power BI Word Cloud

If you are like most observers, you probably focused on the large words. You may have casually glanced around to get a feel for the smaller words as well. In terms of color, there is too much variation to differentiate. Without the option to control colors right now, they do not mean anything anyway.

Now take a few extra seconds to look more closely. Notice anything else?

What about case sensitivity? Roadmap versus roadmap, Azure versus azure, etc. Word clouds are used to roughly judge frequency, but it becomes hard to gauge a word’s frequency relative to other words when it appears multiple times in the same cloud. Maybe the two roadmaps together occur just as frequently as Microsoft. We will never know. Raw, unstructured text is typically littered with initial problems like this that require basic data preparation to get better value from your visual.

Roadmap and roadmap are the same word

Use Power Query to Prepare your Text Data

Fortunately, Power BI has a built-in way to handle many of the quality issues associated with raw text: Power Query (or that tool which has been renamed and which we refuse to stop calling Power Query). It won’t handle anything like misspellings or breaking down words to their stems, but it will help present a better visual with only a few basic steps.

Tip 1: Use Lowercase (or Title Case)

The easiest way to solve duplicates caused by case sensitivity in the aforementioned example is to use lowercase formatting. Title case (“Capitalize Each Word”) could be an option as well, but may make the word cloud a bit more difficult to read. Uppercase could also be selected, but THERE ARE TONE CONSIDERATIONS THAT YOU HAVE TO KEEP IN MIND (no one appreciates an angry word cloud). It could also be argued that upper case is more difficult to read at a glance than the other two types.

In Power Query, perform these steps in only two clicks by selecting Formatand then lowercase or Capitalize Each Word.

Convert to lowercase

Tip 2: Manually Account for Punctuation if Needed

The Word Cloud visual accounts for most punctuation, but it is not perfect. For example, the following input has an ellipsis at the end of the word “cloud”. When the word “cloud…” is placed in a word cloud, most of the punctuation is removed, but “cloud…” and “cloud” are considered distinct. Even attempting a stop word of “…” will not work because “cloud…” is considered one word.

Source data with ellipsis

“cloud” and “cloud…”

In Power Query, this problem can be remedied using Replace Values. If your dataset is being used for more than a word cloud, it would probably be best to Duplicate Column, and then Replace Values. That way, your cleaner text can be used for the word cloud while the raw text is still available if needed elsewhere.

Use Duplicate Column if you need to preserve the original text

Use Replace Values to remove stubborn characters

Tip 3: Duplicate Words Within the Same Record

What happens if the same word appears multiple times in the same record? This situation could be common in longer text fields, and it leads to repetition in the word cloud. Using an extreme example, I have “word” repeated four times in one record and then again in the second. The Word Cloud visual shows a large “word” (frequency across multiple records) and several smaller “word” values (frequency within the same record).

Source data with word repetition in same record

Inline repetition leads to word cloud repetition

The solution in this case should reduce everything down to one “word” in the cloud.

In Power Query, the solution can be attempted by using Transpose to change rows to columns, Split to parse the words in each column into separate columns, and then Transpose again so that we are left with one column.NOTE: Be careful with this as it may create some problems if you are working with a lot of records. This is not ideal (if anyone has any other thoughts using either the ribbon or M, please share).

Final Thoughts

Based on the recent activity in the Power BI Custom Visual Gallery, I think that I need to revisit my initial assessment in the BlueGranite blog written at the time that the Gallery launched:

Is Power BI a business-centered product that should only offer core visualizations that align with expert opinions and research? Should Power BI go to the other extreme and offer more capabilities for data art and unlimited creativity? To a large extent, it no longer matters from a product perspective.

That holds true, but not in the way that I had envisioned. Since the Gallery launch, we have seen new weekly custom visuals published by Microsoft like the Chiclet Slicer, Timeline Slicer, and now the Word Cloud. I had assumed that independent developers would be the ones pushing the boundaries while the Power BI team might take a more conservative approach to content creation. Instead, do you want eye candy that’s somewhat informational? There’s the Word Cloud. Creative and functional? Chiclet. Traditional with a twist? Timeline. There’s no guessing what they will unleash next week.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s