Getting Started with Data Analysis and Visualization with SQL Server and R

收藏待读

Getting Started with Data Analysis and Visualization with SQL Server and R

By:Nai Biao Zhou |   Last Updated: 2019-01-11   |  |   Related Tips:More > SQL Server 2017

Problem

Sarka [1] pointed out that, in a real-life data science project, professionals spend about 70% of time on data overview and data preparation. SQL Server professionals have already known how to query, transform and cleanse data. They are able to work on data science projects after having mastered some statistical techniques. Furthermore, SQL Server 2017 Machine Learning Services (SQL MLS) allow them to write SQL stored procedures containing R codes, thus they can use a wide variety of statistical and graphical techniques provided by R. The powerful graphics capabilities in R fascinate many users [2]. Those professionals who have extensive knowledge in SQL server hope to quickly pick up new skills through an exercise.

Solution

We are going to create a demographic dashboard to give a comprehensive overview of customers. The ability to know customers can bring a significant competitive advantage to a business. These demographic characteristics, for example, educational experience level, occupation and total purchase amount, help the business to develop effective marketing strategies.

In this exercise, we will construct a frequency bar graph to display the number of customers in each occupation. Then, we will introduce the concepts of relative frequency and cumulative relative frequency. We will construct a less-than cumulative relative frequency diagram of customer educational experience level. After an introduction of the concept of percentile, we will use a boxplot to show positions of outliers and a five-number summary of the individual customer total purchases in year 2013. Finally, we will group the total purchase amount into classes and construct a histogram to represent the frequency distribution of each individual customer’s total purchases in year 2013.

We are going to use data from the AdventureWorks sample database “AdventureWorks2017.bak” [3]. We will use “R Tools for Visual Studio (RTVS)”, introduced in my other tip ” Getting Started with Data Analysis on the Microsoft Platform — Examining Data “.

The solution was tested with SQL Server Management Studio V17.4, Microsoft Visual Studio Community 2017, Microsoft R Client 3.4.3, and Microsoft ML Server 9.3 on Windows 10 Home 10.0 . The DBMS is Microsoft SQL Server 2017 Enterprise Edition (64-bit).

1 – Comparing Nominal, Ordinal, Interval and Ratio Variables

When we use a dashboard to exhibit customer demographic data, we face different types of measurement. Educational experience level and purchase amount, for example, are different scales. Stevens’ classification of scales of measurement [4] has been widely used to define the four types of variables in statistics: nominal, ordinal, interval, and ratio.

A nominal variable, is also called categorical variable, represents data by name. Customer occupation, for instance, is a nominal variable having a small number of discrete values. We can determine if two values are equal; but, they cannot be ordered. Sometimes, we may use numeric values to represent categories. For example, we use the numeric values 0 and 1 to represent two genders, female and male, respectively. These numeric values are used as labels only; we cannot order them from lowest to highest.

An ordinal variable, preserves rank-ordering in data, but the difference between each ranking is not defined. The five levels of educational experience in the AdventureWorks, for example, are ordered as partial high school, high school, partial college, bachelors and graduate degree. The category high school means more educational experience than the category partial high school; and the category graduate degree means more educational experience than the category bachelors. But the difference between high school and partial high school is not the same as the difference between graduate degree and bachelors.

An interval variable, is quantitative in nature. The difference between two values is meaningful. A typical example is the measure of temperature. Equal intervals of temperature are corresponded to equal mercury volumes of expansion. For example, the temperature in three rooms A, B and C is 12 degrees, 14 degrees and 16 degrees, respectively. The temperature difference between rooms B and A is as same as the difference between rooms C and B. But the zero value of an interval variable doesn’t mean: doesn’t exist. Zero degree on the Celsius scale, for example, doesn’t mean there is no temperature in a room. In addition, it may not make sense to apply the mathematical operations of multiplication and division to interval variables. For instance, the statement that 30 degrees is twice as warm as 15 degrees is not accurate.

A ratio variable, has all characteristics of an interval variable, except that it has meaning of an absolute zero value. A zero-income customer, for example, means that the customer doesn’t have income. Therefore, we can construct a meaningful fraction (or ratio) with a ratio variable. An example of a ratio variable is weight. If box A weighs 5 pounds and box B weighs 10 pounds, we can say box B is twice as heavy as box A.

It is important to know how these four types of variables differ because statistical and graphical techniques used in analysis differ depending on the type of variables. For example, it would not make sense to compute an average of customer occupations.presents a simplified version of Stevens’ classification table [4]. The value “Yes” in the table denotes that we can apply the corresponding statistical technique to the variable, and the value “No” indicates that we cannot.

Statistical TechniqueNominalOrdinalIntervalRatio
Frequency DistributionYesYesYesYes
ModeYesYesYesYes
Contingency CorrelationYesYesYesYes
MedianNoYesYesYes
PercentilesNoYesYesYes
MeanNoNoYesYes
Standard DeviationNoNoYesYes
Rank-order CorrelationNoNoYesYes
Product-moment CorrelationNoNoYesYes
Coefficient or VariationNoNoNoYes

Table 1 The difference among nominal, ordinal, interval and ratio variables [4]

Variables in statistics also can be broadly classified into two types: qualitative and quantitative. The nominal and ordinal variables are two sub-classifications for the qualitative variable. The interval and ratio variables are two sub-classifications for the quantitative variable.

2 – Analyzing a Nominal Variable

With the knowledge of customer occupations, a business can promote additional products or services to existing customers, namely cross-selling. Five occupations have been identified in the AdventureWorks system: clerical, management, manual, professional, and skilled manual. These values are taken by the occupation variable. Each value represents a category called class. Then, we can answer a business’s question by using class frequency, which is the number of observations in the data set falling in a class [5].

2.1 Construct a Frequency Distribution for a Nominal Variable

Let’s randomly pick up 20 customer’s occupations from the database, as shown in.

1Manual6Manual11Skilled Manual16Management
2Professional7Clerical12Clerical17Skilled Manual
3Management8Management13Professional18Management
4Clerical9Professional14Professional19Skilled Manual
5Professional10Skilled Manual15Manual20Professional

Table 2 Occupation data on 20 customers

We count the occurrences of each occupation. The number we have counted represents the class frequencies for the 5 classes. For comparison, the class frequency can be expressed as a fraction of the total number of observations. These fractions constitute a relative frequency distribution [6].shows two ways we summarized a nominal variable: the class frequency and the class relative frequency. We can easily identify that the value “Professional” appears 6 times in the sample data set. This value represents the mode of the data set, which is a statistical measure.

ClassFrequencyRelative Frequency
Clerical30.15
Management40.20
Manual30.15
Professional60.30
Skilled Manual40.20

Table 3 Frequency table for occupation data on 20 customers

The summary table reveals that 30% of customer occupations are professional, 15% are clerical and another 15% are manual. This may remind the business to add more products or services that these professionals are more likely to buy. This table also suggests that the business should carry out some market activities to attract customers whose occupations are clerical or manual.

2.2 Construct a Frequency Distribution through SQL MLS

Since there are 18,508 individual customers in the AdventureWorks database, it is inefficient for us to manually count the occurrences of each customer occupation. R provides a function, “table()”,for creating a frequency table. We are going to run R codes within a SQL stored procedure to obtain the frequency table. We will use ” R Tools for Visual Studio sample projects ” as a starting point. This sample project from Microsoft provides an in-depth introduction to R through the extensive comments in two source files [7]. If you are the first time user of RTVS, my other tip ” Getting Started with Data Analysis on the Microsoft Platform — Examining Data ” provides a step-by-step procedure to create a stored procedure containing R codes.

2.2.1 Create a New Stored Procedure “sp_analyze_nominal_variable”

Add a new stored procedure to the sample project. As shown in, three files were added into the project. We are going to write R codes in the R file, write a database query in the query file, and modify the template file to integrate R codes and the query into a stored procedure.

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 1 Add a new stored procedure in the project through RTVS

2.2.2 Write a SQL Query to Retrieve Customer Occupations

Open the file “sp_analyze_nominal_variable.Query.sql” and place the following SQL codes into the file. The query retrieves every individual customer occupation from the database.

SELECT p.Occupation
 FROM [Sales].[vIndividualCustomer] c inner join [Sales].[vPersonDemographics] p
            ON c.BusinessEntityID = p.BusinessEntityID

2.2.3 Create a Frequency Table in R

Open the R script file “sp_analyze_nominal_variable.R”; use the following R codes to replace the existing codes. It is noteworthy that we un-comment test codes so that we can test these R codes through RTVS. We must comment these test codes before publishing the stored procedure.

# @InputDataSet: input data frame, result of SQL query execution
# @OutputDataSet: data frame to pass back to SQL
 
# Test code
library(RODBC)
channel <- odbcDriverConnect(dbConnection)
InputDataSet <- sqlQuery(channel, iconv(paste(readLines('your work folder/visualize data/sp_analyze_nominal_variable.query.sql', encoding = 'UTF-8', warn = FALSE), collapse = 'n'), from = 'UTF-8', to = 'ASCII', sub = ''))
odbcClose(channel)
 
# Create one-way contingency table from one categorial variable
freq_table <- table(InputDataSet$Occupation)
 
# Convert a table with counts to a table with proportions
relative_freq_table <- prop.table(freq_table)
 
# Get the category names in the contengency table
category_name <- names(freq_table)
 
# Convert a table data to a vector
vector_freq_table <- as.vector(freq_table)
 
# Convert a table data to a vector
vector_relative_freq_table <- as.vector(relative_freq_table)
 
# Construct a data-frame as the output data set
OutputDataSet <- data.frame("Occupation" = category_name, "Frequency" = vector_freq_table, "Relative Frequency" = vector_relative_freq_table)

Add a database connection to the project, and this step creates a R file “Settings.R”. For convenience, we add another variable “dbConnection” to the source codes, as shown in the follows:

# Application settings file.
# File content was generated on 2018-11-16 9:09:31 PM.
settings <- as.environment(list())
# [Category] SQL
# [Description] Database connection string
# [Editor] ConnectionStringEditor
settings$dbConnection <- 'Driver={SQL Server};Server=(local);Database=AdventureWorks2017;Trusted_Connection=yes'
dbConnection <- 'Driver={SQL Server};Server=(local);Database=AdventureWorks2017;Trusted_Connection=yes'

We should run the Settings.R file to assign the connection string to the variable “dbConnection”. Then, we can test the R codes in the file “sp_analyze_nominal_variable.R”. After running these R codes, we obtain the value of the variable “OutputDataSet” in the R interactive window, as shown in.

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 2 -Output in the R Interactive window

2.2.3 Modify the Stored Procedure Template

Open the template file “sp_analyze_nominal_variable.Template.sql”, use the following script to replace the existing codes.

CREATE PROCEDURE [sp_analyze_nominal_variable]
AS
BEGIN
EXEC [email protected] = N'R'
    , @script = N'_RCODE_'
    , @input_data_1 = N'_INPUT_QUERY_'
--- Edit this line to handle the output data frame.
    WITH RESULT SETS
   ((
      Occupation         nvarchar(50),
      Frequency          int,
      Relative_Frequency decimal(12,10)
   ));
END;

2.2.4 Publish the Stored Procedure to the Database

Before publishing the stored procedure, we should comment testing codes in the R file, as shown in. Then, we use the “Publish Stored Procedure” menu command from the menu “R Tools -> Data” to publish the stored procedure.

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 3 – Comment the testing codes before publishing

2.2.5 Run the Stored Procedure

Run the stored procedure through SSMS. The result should look like. The result reveals that the occupation of most customers is professional, and the least is manual.

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 4 – Run the stored procedure through SSMS

2.3 Create a Frequency Distribution Bar Chart

Although the result, shown in, adequately describes the customer occupations in the database, business users may prefer to a graphical presentation. One of the most widely used graphical methods for describing qualitative data is bar graph [5]. We can use the plot function in the base installation to construct an occupation frequency bar chart.

Run R function “ls()” in the interactive window to ensure the variable “InputDataSet” still exists. Then, Run the following R codes in the R interactive window. We obtain a bar chart as shown in. Through the bar chart, it is obvious that the number of professionals is the highest, and the number of manuals is the lowest.

plot(InputDataSet$Occupation, main = "Occupation",
     xlab = "Occupation",
     ylab = "Number of Customers",
     col = "blue")

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 5 – A bar chart for customer occupation frequency distribution

3 – Analyzing an Ordinal Variable

In the AdventureWorks system exists a list of educational experience levels, from lowest to highest: partial high school, high school, partial college, bachelors and graduate degree. In addition to ask the number of customers at each level, the business users also ask, for example, how many customers are below a level. The solution is to look at the cumulative frequency distribution, which shows a running total of the frequency in the distribution. The running total of the relative frequency is called a cumulative relative frequency distribution [6].

3.1 Construct a Cumulative Frequency Distribution for Ordinal Variable

Let’s randomly select 20 customer educational experience levels from the AdventureWorks database. Since the order of levels is meaningful, we can organize the raw data in ascending order, as shown in. Through the array, we can immediately read out some summary statistics. The lowest level is high school, and the highest level is graduate degree. The median, the value at halfway point, is bachelors. The mode, the most frequently occurring value, is bachelors. Usually, we do not compute the mean of an ordinal variable.

1High School6Partial College11Bachelors16Bachelors
2High School7Partial College12Bachelors17Bachelors
3High School8Partial College13Bachelors18Graduate Degree
4Partial College9Partial College14Bachelors19Graduate Degree
5Partial College10Bachelors15Bachelors20Graduate Degree

Table 4 – Educational levels of 20 customers

To obtain the cumulative distribution, we compute the running total of the class frequency. The cumulative relative frequency is obtained by dividing the running total by the total number of observations in the data set, as shown in.

ClassFrequencyCumulative FrequencyCumulative Relative Frequency
High School330.15
Partial College690.45
Bachelors8170.85
Graduate Degree3201.00

Table 5 – Cumulative frequency table for educational experience levels of 20 customers

3.2 Construct a Cumulative Frequency Distribution through R

The function “cumsum()” in R is used to compute cumulative sum. Let’s add a new script file “compute_cumulative_frequency_distribution.R” to the project. We replace the content of the file with the following R codes:

library(RODBC)
channel <- odbcDriverConnect(dbConnection)
InputDataSet <- sqlQuery(channel,
                         "
      SELECT rtrim(ltrim(p.Education)) AS Education
      FROM [Sales].[vIndividualCustomer] c inner join [Sales].[vPersonDemographics] p
           ON c.BusinessEntityID = p.BusinessEntityID                                
                         ")
 
odbcClose(channel)
 
# R sorts strings in the alphabet order by default.
# I use the factor() function to inform R about the correct order
Education_Level = factor(InputDataSet$Education, order = TRUE,
                   levels = c("Partial High School", "High School", "Partial College", "Bachelors", "Graduate Degree"))
 
# Create one-way contingency table from one categorial variable
freq_table <- table(Education_Level)
 
# Get cumulative frequencies
cumulative_freq <- cumsum(freq_table)
 
# Get cumulative relative frequencies
cumulative_relative_freq <- cumulative_freq / nrow(InputDataSet)
 
# Get the category names in the contingency table
category_name <- names(freq_table)
 
# Convert a table data to a vector
vector_cumulative_freq <- as.vector(cumulative_freq)
 
# Convert a table data to a vector
vector_cumulative_relative_freq <- as.vector(cumulative_relative_freq)
 
# Construct a dataframe as the output data set
OutputDataSet <- data.frame("Education" = category_name, 
                            "Cumulative Frequency" = vector_cumulative_freq, 
                            "Cumulative Relative Frequency" = vector_cumulative_relative_freq)
print(OutputDataSet)

The output in the R interactive window should look like. More than half of customers do not own a bachelor’s degree.

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 6 – Output in the R interactive window

3.3 Create a Cumulative Relative Frequency Diagram

Although the output, shown in, can reveal the summary of customer educational experience levels in the database, business users may prefer to a graphical presentation. We can construct a less-than cumulative relative frequency graph by plotting the less-than cumulative relative frequency against the upper-class limit [6].

To demonstrate the less-than cumulative frequencies, I re-organized the data ininto the less-than cumulative frequencies as shown in. Note that the less-than cumulative frequencies reflect the number of customers below the particular educational experience level. For example, 54% of customers have not obtained a bachelor’s degree.

Educational ExperienceFrequencyLess-than Cumulative FrequencyLess-than Cumulative Relative Frequency
No educational experience to less than Partial High School000.00
Partial High School to less than High School158215820.09
High School to less than Partial College329748790.26
Partial College to less than Bachelors507599540.54
Bachelors to less than Graduate Degree5362153160.83
Graduate Degree to less than the Next Higher Degree3192185081.00

Table 6 – Less-than cumulative frequencies and less-than cumulative relative frequencies

Run the following R codes in the R interactive window. We obtain a diagram as shown in. The x-axis tick values, from 1 to 6, correspond to the upper limits in. Through the less-than cumulative relative frequency diagram, we can easily find the percentage of customers whose educational experience level is under a chosen educational level. For example, 83% of customers do not have a graduate degree.

education_level <- seq(1, length(category_name))
plot(education_level, vector_cumulative_relative_freq,
                    main = "Less-than Cumulative Relative Frequency Diagram",
                    xlab = "Educational Experience Level",
                    ylab = "Cumulative Relative Frequency")
lines(education_level, vector_cumulative_relative_freq)

Getting Started with Data Analysis and Visualization with SQL Server and R

Figure 7 – Less-than cumulative relative frequency diagram of educational experience levels

4 – Analyzing a Ratio Variable

Business users want to analyze individual customer historical purchases to get an insight about customer buying behavior. Thus, the total purchase amount of each individual customer in the past year is of interest to them. In this case, the total purchase amount is the variable of interest, which is numerical in nature. If the total purchase amount of a customer is zero, this indicates that the customer had not purchased anything from AdventureWorks last year. Thus, the total purchase amount in last year is a ratio variable. A ratio variable has all characteristics of an interval variable. Therefore, the statistical techniques used in this section can apply to both the interval variable and the ratio variable.

4.1 Percentiles

Business users like to identify most profitable customers, and design marketing efforts to keep these customers happy. Customer profitability is defined as the difference between revenues and costs. In this exercise, we are going to look at the customer total purchase amount. We would like to compare the total purchase amount of one customer in 2013 to other customers’ purchase amount in the same year.

We assume the total purchase amount of a customer is $3084 in year 2013. To find the percentage of customers whose total purchase amount in 2013 is less than $3084, we can use a positional measure method, percentile, in statistics. There is no universal definition for percentile; we are going to use the definition introduced in [6]. Percentiles are values that divide a data set into 100 equal parts. The value of a percentile is denoted as Getting Started with Data Analysis and Visualization with SQL Server and R . In the given example, if Getting Started with Data Analysis and Visualization with SQL Server and R , we can say that roughly 85% of customers spent less than this value, $3084, and 15% of customers spent larger than this value.

We can compute percentiles for ordinal, interval, or ratio variables. The first step is to arrange raw data into an ascending array, then use the following equation to compute the position number of a percentile in the array [6]:

Getting Started with Data Analysis and Visualization with SQL Server and R

where Getting Started with Data Analysis and Visualization with SQL Server and R denotes the value at the x th percentile; x denotes the number of the desired percentile, N denotes the total number of observations in the data set. The last step is to find the value at that position.

After introduction of percentiles, we can use a five-number summary, including the minimum value, Getting Started with Data Analysis and Visualization with SQL Server and R denotes the value at the x th percentile; x denotes the number of the desired percentile, N denotes the total number of observations in the data set. The last step is to find the value at that position.

After introduction of percentiles, we can use a five-number summary, including the minimum value, Getting Started with Data Analysis and Visualization with SQL Server and R , median ( Getting Started with Data Analysis and Visualization with SQL Server and R ), Getting Started with Data Analysis and Visualization with SQL Server and R and the maximum value, to describe a set of data. We will compute a five-number summary of customer total purchase amount in 2013.

I used the following SQL script to randomly pick up 56 records from the AdventureWorks database. The query results are presented in.

DECLARE @purchase_totals TABLE
(
  purchase_total money
)
 
INSERT INTO @purchase_totals
SELECT sum([SubTotal]) AS Total_Purachase_2013
FROM [Sales].[SalesOrderHeader] o INNER JOIN [Sales].[Customer] c
                                  ON o.CustomerID = c.CustomerID
WHERE c.PersonID is not null
      and year([OrderDate])= 2013
Group by c.PersonID
 
SELECT top 0.5 percent purchase_total
FROM @purchase_totals
ORDER BY NEWID()
2457.332983.34564.99132.9769.97751.3423.7834.98
37.27162689.362458.923121.304.992369.972492.321779.47
96.46103.4869.9924.9933.9839.9886.4583.95
2334.9723.78828.47782.994217.301249.844.9934.47
3084.02553.9763.97123.213360.4268243.95135970.052294.99
101.45139462.79106.9523.7868.4942.282365.9449.97
1700.99726.271382.972466.322479.942511.32323.9934.99

Table 7 – Randomly pick up 56 customer total purchase amount in 2013

To summarize the row data in, a simple method is to organize these data into an array, in which values are in ascending or descending order of magnitude [6]. By using Excel to sort these values, we have obtained the sorted values shown in. We also find the mean of the data set, i.e. 9991.63.

4.9934.4763.97101.45564.991700.992458.923121.3
4.9934.9868.49103.48726.271779.472466.323360.42
23.7834.9969.97106.95751.342294.992479.944217.30
23.7837.2769.99123.21782.992334.972492.3268243.95
23.7839.9883.95132.97828.472365.942511.32135970.05
24.9942.2886.45323.991249.842369.972983.34139462.79
33.9849.9796.46553.971382.972457.333084.02162689.36

Table 8 – Sorted total purchase amount in 2013

Thereveals that the lowest value is 4.99 and the highest value is 162689.36 in the data set. The range, which is the difference between two extreme values, is 162268.37. The value at the halfway point, the median or Getting Started with Data Analysis and Visualization with SQL Server and R , is between 553.97 and 564.99. The median can be assumed to be the average of two middle positions, thus the median of this data set is (553.97 + 564.99)/2 = 559.48.

The 25 th percentile is also called the first quartile, denoted as Getting Started with Data Analysis and Visualization with SQL Server and R , is between 553.97 and 564.99. The median can be assumed to be the average of two middle positions, thus the median of this data set is (553.97 + 564.99)/2 = 559.48.

The 25 th percentile is also called the first quartile, denoted as Getting Started with Data Analysis and Visualization with SQL Server and R , can be computed in two steps:

The first step is to find the position number of the percentile:

Getting Started with Data Analysis and Visualization with SQL Server and R

The second step is to find the value at the position:

The 14 th position in the array is 49.97 and the 15 th position is 63.97, thus:

Getting Started with Data Analysis and Visualization with SQL Server and R

The 75 th percentile is also called the third quartile, denoted as Getting Started with Data Analysis and Visualization with SQL Server and R , which can be computed in the same way as we compute Getting Started with Data Analysis and Visualization with SQL Server and R :

Getting Started with Data Analysis and Visualization with SQL Server and R :

Getting Started with Data Analysis and Visualization with SQL Server and R

The 42th position in the array is 2457.33 and the 43th position is 2458.92, thus:

Getting Started with Data Analysis and Visualization with SQL Server and R

Here is the five-number summary of the sample data set:

  • the minimum value: 4.99
  • Getting Started with Data Analysis and Visualization with SQL Server and R : 53.47
  • median: 559.49
  • Getting Started with Data Analysis and Visualization with SQL Server and R : 2458.52
  • the maximum value: 162689.36

It is worth noting that two functions in R, summary() and fivenum(), can compute the five-number summary. Since R uses a different algorithm to estimate underlying distribution, the Getting Started with Data Analysis and Visualization with SQL Server and R : 53.47

  • median: 559.49
  • Getting Started with Data Analysis and Visualization with SQL Server and R : 2458.52
  • the maximum value: 162689.36
  • It is worth noting that two functions in R, summary() and fivenum(), can compute the five-number summary. Since R uses a different algorithm to estimate underlying distribution, the Getting Started with Data Analysis and Visualization with SQL Server and R and Getting Started with Data Analysis and Visualization with SQL Server and R will be different from the results we have computed by hand.

    When a distribution has extremely high and low values, like the sample data set, it is more useful to focus on data in the range covered by the middle 50% of the observed values. The range where the middle 50% of the values lie is called interquartile range (IQR), which can be computed by:

    Getting Started with Data Analysis and Visualization with SQL Server and R will be different from the results we have computed by hand.

    When a distribution has extremely high and low values, like the sample data set, it is more useful to focus on data in the range covered by the middle 50% of the observed values. The range where the middle 50% of the values lie is called interquartile range (IQR), which can be computed by:

    Getting Started with Data Analysis and Visualization with SQL Server and R

    The interquartile range for the sample data is 2458.52 – 53.47 = 2405.05, which is much less than the range of the data set, 162268.37.

    4.2 Create a Boxplot

    We have computed the five-number summary of the sample data by hand. A boxplot, as shown in, is used to visualize this summary. In addition, a boxplot can identify outliers, which are abnormally different from other values in the data set.

    Getting Started with Data Analysis and Visualization with SQL Server and R

    Figure 8 – boxplot components

    4.2.1 Construct a Boxplot

    Statistical tools, like R and SAS, can plot a boxplot from a data set. Unlike a line graph or a bar graph, the information represented by a boxplot is not obvious to a first-time user. In an attempt to know how to read and construct a boxplot, let’s create a boxplot from the data inby hand.

    Step 1: Compute Q1, Median, Q3, and IQR

    We have already known:

    • Q1 = 53.47
    • Median = 559.49
    • Q3 = 2458.52
    • IQR = 2405.05

    Step 2: Compute Lower Inner Fence, Upper Inner Fence, Lower Outer Fence and Upper Outer Fence

    • Lower Inner Fence = Q1 – 1.5 * IQR = -3554.11
    • Upper Inner Fence = Q3 + 1.5 * IQR = 6066.10
    • Lower Outer Fence = Q1 – 3.0 * IQR = -7161.68
    • Upper Outer Fence = Q3 + 3.0 * IQR = 7268.62

    Step 3: Identify the positions of Lower Whisker and Upper Whisker

    • Lower Whisker is drawn to the most extreme point that is greater than or equal to the Lower Inner Fence, i.e. -3554.11. Sincewas sorted, the first value 4.99, the least value in the data set, is greater than the Lower Inner Fence. Thus, we have obtained the value of Lower Whisker (= 4.99).
    • Upper Whisker is drawn to the most extreme point that is less than or equal to the Upper Inner Fence, i.e. 6066.10. The value at the 52rd position, 4217.30, is the greatest value that is less than the Upper Inner Fence. Thus, we have obtained the value of Upper Whisker (= 4217.30).

    Step 4: Identify the extreme outliers

    All values locate the outside of the outer fences are extreme outliers, marked by *. The data set inhas 4 extreme outliers: 68243.95, 135970.05, 139462.79 and 162689.36.

    Step 5: Identify the mild outliers

    All points beyond the inner fences but inside the outer fences are mild outliers. No mild outliers are found in the data set.

    Step 6: Draw a boxplot

    All the components inhave been identified; and we can use the model presented in this figure as a template to sketch a boxplot.

    It is worth noting that we should treat outliers carefully. Outliers don’t always mean bad data. They may contain some valuable information that is worthy of further investigations. After a careful investigation, I have found the SQL codes I used to randomly select data from the database are not correct. Some values of total purchase amount, returned from the SQL codes, were made by stores rather than individual customers. This kind of outliers should be eliminated in any further analysis.

    4.2.2 Plot a Boxplot in R

    The R function “boxplot” in the base installation can be used to plot a boxplot. The following R codes are used to retrieve a data set from the database, then construct a boxplot, as shown in. We have noticed that the boxplot produced R is slightly different the plot shown in.

    library(RODBC)
    channel <- odbcDriverConnect(dbConnection)
    InputDataSet <- sqlQuery(channel,
                             "
                                SELECT sum([SubTotal]) AS Total_Purachase_2013
                                FROM [Sales].[SalesOrderHeader] o INNER JOIN [Sales].[Customer] c
                                            ON o.CustomerID = c.CustomerID
                                WHERE c.PersonID is not null and c.StoreID is null
                                      and year([OrderDate]) = 2013
                                Group by c.PersonID                           
                             ")
     
    odbcClose(channel)
     
    # Plot a Boxplot
    boxplot(InputDataSet$Total_Purachase_2013,
            horizontal = TRUE,
            main = "Box plot",
            xlab="Total Purchase of Each Customer in 2013")
     
    # Print the statistics used to build the graph
    boxplot.stats(InputDataSet$Total_Purachase_2013)

    Getting Started with Data Analysis and Visualization with SQL Server and R

    Figure 9 – A boxplot of total purchase of each customer in 2013

    Two extreme outliers have been identified in the boxplot. The R function “boxplot.stats” tell that the data set has 53 outliers out of 10,541 observations. Before considering trimming out these outliers, we should investigate whether they are valid values or not; and whether the similar values will continue to enter to the system.

    4.3 Construct a Frequency Distribution for a Ratio Variable

    We have respectively constructed frequency distributions for a nominal variable and an ordinal variable; these two kinds of variables can only take on a small number of discrete values. Since an interval variable or a ratio variable usually has many distinct values, we will apply a different approach to construct the frequency distribution.

    Firstly, 4 outliers inwere eliminated because they came from the SQL query error. All valid values are presented in.

    4.9934.4763.97101.45564.991700.992458.923121.3
    4.9934.9868.49103.48726.271779.472466.323360.42
    23.7834.9969.97106.95751.342294.992479.944217.30
    23.7837.2769.99123.21782.992334.972492.32
    23.7839.9883.95132.97828.472365.942511.32
    24.9942.2886.45323.991249.842369.972983.34
    33.9849.9796.46553.971382.972457.333084.02

    Table 9 – Sample data of customer total purchase amount in 2013

    Then, we define a class with a lower limit and upper limit. The difference between the lower limit and the upper limit is called class width. We can group values into different classes by comparing values to these limits. We are going to follow procedure in [6] to construct a frequency distribution for a ratio variable. This procedure can apply to continuous variables that take on any values in a range.

    Step 1: Determine the number of classes and the width of the class intervals.

    It is recommended that the number of classes should be between 6 and 15. Since we only have 52 values in the data set, we should keep the number of classes low [6].  The range of the values inis 4217.30 – 4.99 = 4212.31. If we define 6 classes, the class width should be 4212.31/6= 702.05. Thus, a class width of 500 is a rational. Since 4212.31/500=8.42, the number of classes will be 9.

    Step 2: Construct classes

    Since the least value is 4.99, we can select 0 as the first lower limit. Then, the lower limits of other classes will be 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000 and 4500.

    Step 3: Count the class frequencies

    The class frequency is the number of observations in the data set falling in a class interval. By counting the occurrence of observations in each class, we have obtained a frequency table as shown in.

    Class IntervalClass Frequencies
    0 to under 50027
    500 to under 10006
    1000 to under 15002
    1500 to under 20002
    2000 to under 25009
    2500 to under 30002
    3000 to under 35003
    3500 to under 40000
    4000 to under 45001

    Table 10 – Frequency distribution of sample data

    4.4 Create a Frequency Distribution Histogram in R

    A histogram is constructed directly from a frequency distribution. The histogram is a vertical bar graph that the width of the bar represents class width and the height of each bar represents the frequency or relative frequency of the class. One important feature of the histogram is that all bars should touch adjacent bars.

    R provides a function “hist” to plot a histogram from a data set. The number of bar in the histogram is determined by the “breaks” parameter. The following R codes will load a data set from the database then plot a histogram:

    library(RODBC)
    channel <- odbcDriverConnect(dbConnection)
    InputDataSet <- sqlQuery(channel,
                             "
                                SELECT sum([SubTotal]) AS Total_Purachase_2013
                                FROM [Sales].[SalesOrderHeader] o INNER JOIN [Sales].[Customer] c
                                            ON o.CustomerID = c.CustomerID
                                WHERE c.PersonID is not null and c.StoreID is null
                                      and year([OrderDate]) = 2013
                                Group by c.PersonID                           
                             ")
    odbcClose(channel)
    # Plot a histogram
    hist(InputDataSet$Total_Purachase_2013,
            breaks = 9,
            col= "blue",
            main = "Frequency Distribution Histogram",
            xlab = "Total Purchase 2013")

    Getting Started with Data Analysis and Visualization with SQL Server and R

    Figure 10 – A frequency distribution histogram for customer total purchase in 2013

    The boxplot shown ingives an overview of the data distribution. The histogram illustrated ingives more detail information. Firstly, the distribution is not symmetric. Most customers spent less than $1,000 on AdventureWorks products or services in 2013; only a few customers spent more than $5,000. We can also compare customer groups, for example, the number of customers whose total purchase amount in the range [2000, 3000] is twice as large as the number of customers whose total purchase amount in the range [1000, 2000].

    5 – Pass Multiple Datasets to the Stored Procedure: SP_EXECUTE_EXTERNAL_SCRIPT

    One objective of this exercise to construct a dashboard with 4 graphs that we have created in the previous sections. The data used for these 4 graphs came from two different data sets. According to [8], we can pass only one input dataset to the stored procedure “sp_execute_external_script”. [8] implies that we can use R codes to connect the SQL server to retrieve an additional dataset. I do not think this is a preferable solution because we need to specify a database connection string in R codes. I prefer to a neat solution whereby we serialize a dataset first then pass this serialized dataset as a stored procedure parameter. To use this method, we need three stored procedures to construct the dashboard.

    5.1 Create a Stored Procedure “sp_customer_demographics”

    We create this stored procedure to retrieve customer demographics data from the database, and then define the serialized dataset as output parameter of this stored procedure. Since we use the RTVS tool, one stored procedure consists of three source files.

    5.1.1 sp_customer_demographics.Query.sql

    -- Place SQL query retrieving data for the R stored procedure here
    
    SELECT p.Occupation,rtrim(ltrim(p.Education)) AS Education
    FROM [Sales].[vIndividualCustomer] c inner join  [Sales].[vPersonDemographics]  p
                                         ON c.BusinessEntityID = p.BusinessEntityID

    5.1.2 sp_customer_demographics.R

    # @InputDataSet: input data frame, result of SQL query execution
    # @OutputDataSet: data frame to pass back to SQL
     
    # Test code
    # library(RODBC)
    # channel <- odbcDriverConnect(dbConnection)
    # InputDataSet <- sqlQuery(channel, )
    # odbcClose(channel)
     
    # Serialize dataset
    output_serialized_r <- serialize(InputDataSet, NULL)

    5.1.3 sp_customer_demographics.Template.sql

    CREATE PROCEDURE [sp_customer_demographics]
       @demographics_data_sql VARBINARY(MAX) output
    AS
    BEGIN
    EXEC [email protected] = N'R'
        , @script = N'_RCODE_'
        , @input_data_1 = N'_INPUT_QUERY_'
       , @params = N'@output_serialized_r VARBINARY(MAX) OUTPUT'
       , @output_serialized_r = @demographics_data_sql OUT
    --- Edit this line to handle the output data frame.
        WITH RESULT SETS NONE;
    END;

    This template file presented how a variable in R codes passed a value to a SQL variable.

    5.2 Create a Stored Procedure “sp_customer_total_purachase_2013”

    We create this stored procedure to retrieve customer total purchase amount from sales data in the database and, then, define the serialized dataset as an output parameter of this stored procedure.

    5.2.1 sp_customer_total_purachase_2013.Query.sql

    -- Place SQL query retrieving data for the R stored procedure here
    SELECT sum([SubTotal]) AS Total_Purachase_2013
        FROM [Sales].[SalesOrderHeader] o INNER JOIN [Sales].[Customer] c 
                ON o.CustomerID = c.CustomerID
    WHERE c.PersonID is not null and c.StoreID is null and year([OrderDate]) = 2013
    Group by c.PersonID

    5.2.2 sp_customer_total_purachase_2013.R

    # @InputDataSet: input data frame, result of SQL query execution
    # @OutputDataSet: data frame to pass back to SQL
     
    # Test code
    # library(RODBC)
    # channel <- odbcDriverConnect(dbConnection)
    # InputDataSet <- sqlQuery(channel, )
    # odbcClose(channel)
     
    # Serialize dataset
    output_serialized_r <- serialize(InputDataSet, NULL)

    5.2.3 sp_customer_total_purachase_2013.Template.sql

    CREATE PROCEDURE [sp_customer_total_purachase_2013]
       @total_purachase_data_sql VARBINARY( MAX ) output
    AS
    BEGIN
    EXEC [email protected] = N'R'
        , @script = N'_RCODE_'
        , @input_data_1 = N'_INPUT_QUERY_'
       , @params = N'@output_serialized_r VARBINARY( MAX ) OUTPUT'
       , @output_serialized_r = @total_purachase_data_sql OUT
    --- Edit this line to handle the output data frame.
        WITH RESULT SETS NONE;
    END;

    5.3 Create a Stored Procedure “sp_plot_customer_dashboard”

    Each of the two stored procedures “sp_customer_demographics” and “sp_customer_total_purachase_2013” has an output parameter, which contains a serialized data set retrieved from the database. We will create a new stored procedure “sp_plot_customer_dashboard” to read values in these two parameters and un-serialize them. The new stored procedure also plots 4 graphs and arrange them into one page. Since we obtain data from other two stored procedures, there is no SQL query in the Query.sql file. We only need to edit the R code file and the template file.

    5.3.1 sp_plot_customer_dashboard.R

    # @InputDataSet: input data frame, result of SQL query execution
    # @OutputDataSet: data frame to pass back to SQL
    
    # Test code
    # library(RODBC)
    # channel <- odbcDriverConnect(dbConnection)
    # InputDataSet <- sqlQuery(channel, )
    # odbcClose(channel)
    
    # Un-serialize dataset
    demographics_data <- unserialize(demographics_data_r)
    purchase_data <-unserialize(purchase_data_r)
     
    # Save the graph to a pdf file
    pdf(paste("C:Developmentworkfolder", as.character(Sys.Date()), "dashboard.pdf"))
    
    # Determine the layout of the dashboard
    op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
              pty ="m") # generates the maximal plotting region.
     
    # Plot occupation bar graph
    plot(demographics_data$Occupation, main ="Occupation Frequency Distribution",
         ylab ="Number of Customers",
         las = 2,
         cex.names = 0.8,
         col ="blue")
    
    # Plot Total Purchase Boxplot
    boxplot(purchase_data$Total_Purachase_2013,
            horizontal =TRUE,
            main ="Customer Total Purchases Distribution",
            xlab ="Total Purchases of Each Customer in 2013")
    
    ## Compute Cumulative Relative Frequency of Customer Educational Level
    
    # R sorts strings in the alphabet order by default.
    # I use the factor() function to inform R about the correct order
    Education_Level = factor(demographics_data$Education, order = TRUE,
                      levels = c("Partial High School", "High School", "Partial College", "Bachelors", "Graduate Degree"))
    
    # Create one-way contingency table from one categorial variable
    freq_table <- table(Education_Level)
    
    # Get cumulative frequencies
    cumulative_freq <- cumsum(freq_table)
     
    # Get cumulative relative frequencies
    cumulative_relative_freq <- cumulative_freq / nrow(demographics_data)
    
    # Get the category names in the contingency table
    category_name <- names(freq_table)
    
    # Convert a table data to a vector
    vector_cumulative_freq <- as.vector(cumulative_freq)
    
    # Convert a table data to a vector
    vector_cumulative_relative_freq <- as.vector(cumulative_relative_freq)
    
    # Plot less-than cumulative relative frequencies against the upper class limit
    # Thus, we add a "Next Higher Degree" level, which is higher than "Graduate Degree"
    # Construct x-axis
    Educational_experience_level <- seq(1, length(category_name)+1)
    Less_than_cumulative_relative_freq <- c(0, vector_cumulative_relative_freq)
     
    # Plot less-than cumulative relative frequencies
    plot(Educational_experience_level, Less_than_cumulative_relative_freq,
    main = paste("Less-than Cumulative Relative ", "nFrequency Diagram"),
    xlab ="Educational Experience Level",
    ylab ="Cumulative Relative Frequency")
    lines(Educational_experience_level, Less_than_cumulative_relative_freq)
    legend("topleft", legend = c("1: Partial High School",                            
                                 "2: High School",                            
                                 "3: Partial College",                           
                                 "4: Bachelors",                            
                                 "5: Graduate Degree"),
                    , cex = 0.6, title ="Levels", bg ="lightblue")
    
    # Plot Frequency Distribution Histogram
    hist(purchase_data$Total_Purachase_2013,
         breaks = 9,
         col ="blue",
         main = paste("Frequency Distribution Histogram", "nfor Customer Total Purchases"),
         xlab ="Total Purchases in 2013")
     
    # shut down the current device
    dev.off()

    5.3.2 sp_plot_customer_dashboard.Template.sql

    CREATE PROCEDURE [sp_plot_customer_dashboard]
    AS
    BEGIN
    Declare @demographics_data_sql VARBINARY(MAX), @purchase_data_sql VARBINARY(MAX)
     
    EXEC [dbo].[sp_customer_demographics]@demographics_data_sql OUTPUT
    EXEC [dbo].[sp_customer_total_purachase_2013]@purchase_data_sql OUTPUT
     
    EXEC [email protected] = N'R'
        , @script = N'_RCODE_'
       , @params = N'@demographics_data_r VARBINARY(MAX), @purchase_data_r VARBINARY(MAX)'
       , @demographics_data_r = @demographics_data_sql
       , @purchase_data_r = @purchase_data_sql
    --- Edit this line to handle the output data frame.
        WITH RESULT SETS NONE;
    END;

    This template file demonstrated how we passed a value of a SQL variable to a variable in R codes.

    6 – Put It All Together

    We have created three stored procedures through RTVS in the previous section. Next, we publish all these stored procedures to a database and, finally, we run the stored procedure in SSMS. A PDF file will be generated in the folder “C:Developmentworkfolder”. The content in the file should look like.

    Getting Started with Data Analysis and Visualization with SQL Server and R

    Figure 11 – The customer demographic dashboard

    This dashboard reveals a big picture of AdventureWorks customers. AdventureWorks serves a broad diversity of customers across different occupations and educational experience levels. More than half of customers are technical people whose occupations are professional or skilled manual. Roughly half of customers do not have bachelor’s degree. We can also find that most customers spent less than $1000 in 2013; only a few customers spent more than $5000; and two customers spent more than $8000.

    Summary

    We have compared 4 types of variables: nominal, ordinal, interval and ratio. We have used Stevens’ classification table [4] to list permissible statistics for each type of variables. We have placed a focus on analyzing single variable of the customer demographic data in the AdventureWorks database. We, initially, analyzed the occupation variable and, subsequently, the educational experience level variable. Finally, we studied the customer total purchase amount in 2013. We have combined the frequency distribution bar graph, the less-than cumulative relative frequency diagram, the boxplot and the frequency distribution histogram into the dashboard that can provide valuable insight to business users.

    References

    [1] Sarka, D. (2018). Data Science with SQL Server Quick Start Guide . Birmingham, UK: Packt Publishing.

    [2] Kabacoff, R. (2015). R in Action, Second Edition: Data analysis and graphics with R . Shelter Island, NY: Manning Publications.

    [3] Kess, B. (2017, December 12). AdventureWorks sample databases. Retrieved from https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks.

    [4] Stevens, S. S. (1946). On the Theory of Scale of Measurement. Science, 103 (2684), 677-680. http://www.jstor.org/stable/1671815

    [5] William, M., & Sincich, T. (2012). A Second Course in Statistics: Regression Analysis (7th Edition) . Boston, MA: Prentice Hall.

    [6] Hummelbrunner, S. A., Rak, L. J., Fortura, P., & Taylor, P. (2003). Contemporary Business Statistics with Canadian Applications (3rd Edition) . Toronto, ON: Prentice Hall.  

    [7] Brockschmidt, K., Hogenson, G., Warren, G., McGee, M., Jones, M., & Robertson, C. (2017, December 12). R Tools for Visual Studio sample projects. Retrieved from https://docs.microsoft.com/en-us/visualstudio/rtvs/getting-started-samples?view=vs-2017.

    [8] Takaki, J., Steen, H., Rabeler, C., Mike, B., Kess, B., Hamilton, B, Roth, J. & Guyer, C. (2018, July 14). Quickstart: Handle inputs and outputs using R in SQL Server. Retrieved from https://docs.microsoft.com/en-us/sql/advanced-analytics/tutorials/rtsql-working-with-inputs-and-outputs?view=sql-server-2017.

    Next Steps

    Last Updated: 2019-01-11

    Getting Started with Data Analysis and Visualization with SQL Server and R

    Getting Started with Data Analysis and Visualization with SQL Server and R

    About the author

    Getting Started with Data Analysis and Visualization with SQL Server and R

    About the author

    Getting Started with Data Analysis and Visualization with SQL Server and R Nai Biao Zhou is a Senior Software Developer with 20+ years of experience in software development, specializing in Data Warehousing, Business Intelligence, Data Mining and solution architecture design.

    View all my tips

    Related Resources

    原文 : MSSQLTips

    相關閱讀

    免责声明:本文内容来源于MSSQLTips,已注明原文出处和链接,文章观点不代表立场,如若侵犯到您的权益,或涉不实谣言,敬请向我们提出检举。