Power BI Analytics Challenge: Part 3 – CSV Benchmarks + Filtering with R during Import

Using the R Script in Power BI does not exclusively have to be for advanced transformations or for connecting to sources that Power BI does not easily ingest. Sometimes, the R Script is simply faster and more convenient depending on what you want to do. To be fair, it is often slower too depending on the data source or other factors, but that does not appear to be the case with large text files.

I had never paid much attention to CSV load times before in Power BI, but using the Text/CSV option under Get Data in Power BI Desktop is noticeably slower than using R for large CSV files. In addition, another benefit of using R is that you can also filter the data in your R script while loading it into Power BI. This avoids unnecessary time spent loading a complete file via the CSV connector and then filtering it afterward.

time-731110_640

REMINDER: In my first two posts on the Power BI Analytics Challenge, I laid out the various topics and mentioned that I had selected the one using the American Time Use Survey. For the Challenge, participants can submit one entry in each of the three categories by June 4. Due to time, however, I chose to only focus on one instead of competing in all three.

 

SOURCE DATA:

The ATUS data from the Bureau of Labor Statistics consists of numerous CSV files containing survey responses for daily activities and their duration. I imported two of the available CSVs for my Power BI data model.

  • Activity Summary file
    • Daily summary of minutes broken down by activity code
    • Includes extra features such as a subset of respondent data
    • Spans 2003 to 2015
    • atussum.csv
    • 156 MB
    • 170,842 rows
    • 455 columns (one column for each of the 400+ activities)
  • Respondents file
    • Full set of attributes describing survey respondents
    • Spans 2003 to 2015
    • atusresp.csv
    • 65 MB
    • 170,842 rows
    • 132 columns

 

POWER BI    vs    POWER BI + R    vs    Standalone R

Along the way, I explored how Power BI Desktop’s CSV connector compares to the R Script. For example, a quick benchmark on my workstation for the 156 MB activity summary file showed a modest improvement when using Power BI’s R Script for a file this size:

  • Power BI “Text/CSV” – 146 seconds
  • Power BI “R Script” – 105 seconds
  • R – 8 seconds (not a typo)

The R code that I used for these times was a straightforward file load of the CSV’s entire contents using the readr package.

library(readr)
file.path <- "[...]/atussum.csv"
atussum <- read_csv(file.path)

pygmy-sloth-62869_960_720

There is some serious delay though when using the R Script in Power BI versus R alone (13x in this case!). Even with the delay, the R Script still loaded faster than the standard CSV connector. I do not know what causes this substantial delay with the R Script, but it could have something to do with the R.Execute() function in Power Query being a wrapper for an ADO.NET query. The first Power Query / M code block below calls R without using the R Script and  instead directly uses ADO.NET. It is equivalent to using Power BI’s R Script in the second block (and it takes equally as long to run).

let
 rQuery = "library(readr)#(lf)#(lf)file.path <- ""C:/[..]/atussum.csv""#(lf)df <- read_csv(file.path)",
 Source = AdoDotNet.DataSource("R.Provider", "Key=This Is So Cool"),
 Final = Value.NativeQuery(Source, rQuery),
 Result = RData.FromBinary(Final{0}[Result]),
 df = Result[df]
in
 df
let
 Source = R.Execute("library(readr)#(lf)#(lf)file.path <- ""C:/[...]/atussum.csv""#(lf)atussum <- read_csv(file.path)"),
 atussum1 = Source{[Name="atussum"]}[Value]
in
 atussum1

 

FILTERING CSV DATA DURING IMPORT

If you do not wish to load all of the content from your CSV file(s) at once, there is an alternative. The R Script allows you to pre-filter the data. I chose to include only 2008 to 2015. The decision to limit the data in both the Activity Summary and Respondents files to these years was twofold:

1) Most of the activity codes are stable as of 2008 and after. Prior, there are numerous notes in the activity lexicon stating that such and such a code had an alternate prior to 2005, or 2008, etc.

2) I can more easily get my PBIX file below the 50 MB threshold required by the Analytics Challenge. Without excluding those early years, my PBIX was well over 50 MB using only the Activity Summary file (It should also be noted that this was prior to trimming down the number of columns too. A lot of columns will not be included in my final report.).

Using Power BI alone, I first would have had to import the entire CSV at the cost of over two minutes, then Filter it by year in the Query Editor, then Close and Apply, and then wait another two minutes while the data loaded.

My import script for the Activity Summary appears below, with additional transformation using Power BI. The tuyear variable corresponds to the survey year, and including only 2008 to 2015 takes the number of records from 170,842 down to 97,920. Because all of the activity codes are columns, I also performed an unpivot using Power BI. Unpivoting this data to get the activity codes onto rows instead of columns makes it easier to join to the Activity Lexicon and explore durations by activity categories. The row count increases to 42.2 million because there are over 400 activities.

library(readr)
library(dplyr)

file.path <- "C:/[...]/atussum.csv"
atussum <- read_csv(file.path) %>%
    filter(tuyear >= 2008)
let
 Source = R.Execute("library(readr)#(lf)library(dplyr)#(lf)#(lf)file.path <- ""C:/[...]/atussum.csv""#(lf)atussum <- read_csv(file.path) %>%#(lf) filter(tuyear >= 2008)"),
 atussum1 = Source{[Name="atussum"]}[Value],
 #"Unpivoted Columns" = Table.UnpivotOtherColumns(atussum1, {"tucaseid", "gemetsta", "gtmetsta", "peeduca", "pehspnon", "ptdtrace", "teage", "telfs", "temjot", "teschenr", "teschlvl", "tesex", "tespempnot", "trchildnum", "trdpftpt", "trernwa", "trholiday", "trspftpt", "trsppres", "tryhhchild", "tudiaryday", "tufnwgtp", "tehruslt", "tuyear"}, "Attribute", "Value")
in
 #"Unpivoted Columns"

The import script for the Respondents file was very similar. This takes the number of records from 170,842 down to 97,920 as well. I did not need to unpivot the Respondents data.

library(readr)
library(dplyr)

file.path <- "C:/[...]/atusresp.csv"
atusresp <- read_csv(file.path) %>%
 filter(tuyear >= 2008)
let
 Source = R.Execute("library(readr)#(lf)library(dplyr)#(lf)#(lf)file.path <- ""C:/Users/deldersveld/Desktop/AnalyticsChallenge-ATUS/atusresp.csv""#(lf)atusresp <- read_csv(file.path) %>%#(lf) filter(tuyear >= 2008)"),
 atusresp1 = Source{[Name="atusresp"]}[Value]
in
 atusresp1

At this stage, the PBIX file was down to 50.5 MB. I still needed to review and remove some columns from the Activity Summary table that duplicated data from the Respondents file, import the Activity Lexicon, and account for all the attribute values being keys rather than values. Don’t worry though. I ultimately end up under the 50 MB threshold.

 

STILL HERE? WANT A FREE eBOOK?

Did you learn something about using R in Power BI in this post? Interested in more? I had a small hand in a free eBook about using R in Power BI that BlueGranite published last week. It covers using R scripts, R visuals, and “R powered” custom visuals.

Please download a copy and spread the word!

If there is enough interest based on downloads for this initial eBook, one potential follow up project could be a “Powe[R] BI Cookbook” with numerous examples targeted specifically to using R in Power BI for both data transformation and visualization. Please show support for this potential future endeavor by downloading the eBook and leaving comments.

-David

 

 

All images in this post are available for free use under Creative Commons CC0 from https://pixabay.com.

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s