SQL Server and R Integration – Part 2 – Running R Code

May 27, 2017

In a previous post I covered setting up SQL Server for R language integration. Now that we are set up, we can put this functionality to use.

Simple R Function Call:
First we’ll get the average from a list of numbers retrieved from SQL Server using one of R’s built-in functions. First we’ll make a list of numbers.

drop table if exists dbo.RTest;
go
create table dbo.RTest (
Measure tinyint not null
);
go

insert into dbo.RTest(Measure) values (1);
insert into dbo.RTest(Measure) values (3);
insert into dbo.RTest(Measure) values (4);
insert into dbo.RTest(Measure) values (7);
insert into dbo.RTest(Measure) values (8);
insert into dbo.RTest(Measure) values (10);
go 10000

select count(*) as RecordCount from dbo.RTest;
go

This will give us a list of 60,000 numbers. Now it is easy enough to find the average of all of these numbers in T-SQL:

select avg(Measure) from dbo.RTest;

So a few things about R before we run some R code. R is case sensitive. We use When results are returned from R to SQL Server, they will need to be as a data frame.

Let’s use use R to find the average of the ‘Measure’ values in our test table. Here is the stored procedure call:

EXEC sp_execute_external_script  
  @language=N'R',    
  @script=N'OutputDataSet <-data.frame(mean(InputDataSet$Measure))',      
  @input_data_1=N'SELECT Measure FROM dbo.RTest'   
  WITH RESULT SETS (([Average] tinyint NULL)); 

The ‘sp_execute_external_script’ stored procedure is used to execute R code. In the @language parameter we specify ‘R’ since we want to run R code. In the @input_data_1 parameter, we specify the data set that we want to pass in, in this case we select all of the records from our 1 column table. The @script parameter is where we place the R code. Any data returned by the code in the @input_data_1 parameter is visible in our R code as ‘InputDataSet’ (We can use a different name by specifying in a different parameter, but we’ll go with the defaults to keep it simple).
Even though our input data set has only one attribute (Measure), it should be specified, InputDataSet$Measure. We use the R function mean to find the mean value of all of the Measure values. So even though it is a scalar value being returned, it needs to be converted to a data frame, which is assigned to the default output variable OutputDataSet.
For the output, we use RESULT SETS to specify the name, data type and nullability of each attribute in the data set. In this case, we get the value 5 as ‘Average’.

Installing R Packages:
One of the powerful things about R is all of the external libraries of functionality that are freely available to use. If we want to explore machine learning or advanced statistical functions, we’ll need to find and install these packages.
RStudio Support has a list of the most popular R packages.
Once we find a package that we’d like to use, we’ll need to download and install it. We’ll install stringr, which has regular expression and other string manipulation functionality.
Running install.packages in the sp_execute_external_script proc didn’t work for me, I got a message that the package wasn’t available.
SQL Server Central – Installing R packages in SQL Server R Services gives some other options to install packages.
I ended up using the R command line option to install stringr. Since the package was dependent on two other packages, they were retrieved and installed as well.

Other Links:
MSDN – Using R Code in Transact-SQL
Simple Talk


Calendar Table And Marking Holidays

March 15, 2017

A lot of Data Marts or reporting databases will include a Calendar or Dates table to list all days within a certain range, along with data on those days (Day of the week, quarter, etc.).
I’ve posted my script on GitHub that will generate these records.
At a previous job, we wanted to track days the office was closed so that we could measure how many business days certain functions took. Determining weekends was easy, but calculating holidays took a little more effort. You could just manually enter the holidays for each year, but it is possible to calculate the holidays for each year in one pass.
Some holidays are the same date every year, like Christmas.

UPDATE dbo.Calendar SET IsHoliday = 1 WHERE [Month] = 12 AND [Day] = 25;

Others are on a specific day of the week, like Labor Day being the first Monday in September (This script is marking US holidays).

UPDATE c
SET IsHoliday = 1
FROM dbo.Calendar as c
JOIN (
	SELECT DateKey,
		ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY DateKey) AS RowNumber
	FROM dbo.Calendar
	WHERE [Month] = 9
		AND [WeekDay] = 2
) AS z
ON z.DateKey = c.DateKey
	AND z.RowNumber = 1;

This statement will find the first Monday of each September for the range of dates in the table, and mark it as a holiday.
Another method I saw used was to look for a Monday in September where the day was between 1 and 7, since the first Monday of the month will always be in that range.
Of course, the hardest holiday to calculate is Easter. Luckily, I found an algorithm on the US Naval Observatory site to determine the date. I’ve included a function in the Calendar script to calculate the date.
Another thing to consider (which I didn’t include in this script) is if a holiday falls on a weekend, a business may choose the nearest weekday to observe that holiday.


BETWEEN

March 6, 2017

While writing a query recently, I made use of the BETWEEN operator, which will match all values within a specified range. Without thinking, I put the greater value first:
WHERE RecordId BETWEEN 100 and 90

I was somewhat surprised that no records were returned, I knew there should be matches. Reversing the order gave me the results I expected:
WHERE RecordID BETWEEN 90 and 100.

So running the query and getting the execution plan, I saw that the WHERE clause was transformed to:
WHERE RecordID >= 90 and RecordID <= 100

Seeing this, it becomes obvious why the order would matter.

MSDN – BETWEEN


SQL Server and R Integration – Setup

February 27, 2017

A new feature in SQL Server 2016 is the ability to run R scripts within SSMS. For those not familiar with the R language, it is an open source language used to performed statistical computations, to make use of machine learning algorithms, and for data visualization.
In this post, I’ll go through a quick explanation on installing and setting up the services.

Background:
The R language comes with a great deal of built-in functionality, but its real power comes from the hundreds of packages available, created by the R community to extend the functionality of R. R is an interpreted language, using vectors and data frames (similar to tables) as the primary data structures.
R can connect to SQL Server to retrieve data, but there are a few drawbacks. For one, R works with data in memory, so your dataset is limited to the amount of memory available in your workspace. Also, R is single-threaded, so we can’t take advantage of distributing the workload. Working with R within SQL Server will allow us to use a larger dataset as well as take care of

Installation:
The R components for SQL Server are installed from SQL Server setup. If the services weren’t installed with the initial installation of the database engine, you can go back and add the components.
One option is to install a stand alone R Server, that would be used as a separate instance to use in analyzing data using R. There is also an option to install in-database R services. This optional also installs an additional service, the SQL Server Launchpad, that allows integration with R. In my testing, I elected to work with the in-database component with my existing SQL Server instance.

Setup:
Once the components have been installed, the ‘External Scripts Enabled’ setting will need to be set to true.

EXEC sp_configure 'external scripts enabled', 1;
GO
RECONFIGURE;
GO

We’ll also need to restart the database instance to have this setting take effect. We also need to make sure that the SQL Server Launchpad service is running.

Test:
We’ll make a call to the sp_execute_external_script stored procedure to make sure that R is properly set up and that we can make a call.
I found this sample code on MSDN that will allow us to test the R set-up without writing R code or retrieving data from tables right now.

EXEC sp_execute_external_script  
  @language =N'R',    
  @script=N'OutputDataSet<-InputDataSet',      
  @input_data_1 =N'SELECT 1 AS hello'    
  WITH RESULT SETS (([hello] int not null));    
GO 

If everything is setup correctly, we should get a single value of 1 returned with the column header ‘Hello’.

Next Steps:
In the next post, I’ll put together some actual R code and look at some of the things we can do with SQL Server data and R.

Additional Information:
SQL Server Central: Introduction to Microsoft R Services in SQL Server 2016
Syncfusion E-Book: R Succinctly – Introduction to the R language – Free, login required
Coursera – Data Science: Series of courses in Data Science, several deal with learning R.


Integers vs Strings For Joins

January 31, 2017

One recurring argument that comes up with databases is determining a primary key for a table. Should a natural key (which may be a string value) be used, or should a integer surrogate key be generated? In terms of query performance, the common wisdom is that integer keys will perform better. I’ve always been doubtful about that claim, I can see where if values were of different sizes then the performance would be different.
I ran across this post that compared string vs integer key values for query performance.
I wanted to perform my own test to compare the performance. I’ve posted a script on Github to build out some test tables. We end up with a Customer table with a little over 2 million records. We’ll use two different queries to join to a State table, one joining on an integer StateId value, the 2nd joining on a string StateCode. The StateId will be a smallint, and the StateCode a char(2) value, so that both columns are 2 bytes. The Customer table has a clustered index on the Customer ID, with nonclustered indexes on both the StateId and StateCode columns. The State table is a heap (it only has 62 records), since ordering by either StateId or StateCode may give an advantage to that type.
So with running the two queries, the first joining on StateId is 46% of the run time, with the StateCode string join as the other 54%. So the integer column key is faster in this case. However, on examining the query plan, Hash joins were used, so certainly we can get a more efficient plan.
I went back and added INCLUDE columns to the two Customer indexes, so that all of the returned attributes are in the index leaf level. Once I made this change, the query plan showed that the queries used Merge joins, and the query cost is the same for both queries.
So my conclusion is that there is no gain in query performance when joining on strings or integers, assuming that the values are of the same size. I need to do more research on Hash joins, but it appears that a string value hash may possibly be larger than a integer value hash, which would account for the difference in query performance.


SQL Server 2016 SP1: T-SQL Enhancement – CREATE OR ALTER

November 27, 2016

In the Service Pack 1 for SQL Server 2016 (Download here), a new T-SQL enhancement has been added:
CREATE OR ALTER

This command can be used with stored procedures, functions, triggers and views. So it doesn’t matter if an object already exists or not, we can issue the same command for either case. No more checking for the existence of an object or dropping and recreating an object.
For example:

create or alter procedure dbo.GetPosition
	@PositionCode char(2)
as

select PositionCode, PositionDescription 
from dbo.Position
where PositionCode = @PositionCode;

This will definitely simplify any deployment process. I’ve never liked dropping and re-creating an object, since you lose any permissions that have been granted to for an object.

MSDN Article: CREATE OR ALTER


Query Performance: PIVOT vs CASE

October 10, 2016

A discussion came up at my job on the most efficient way to return subtotals and totals in one row, something like:

Total A Total B Total C Grand Total
1 2 3 6

I had assumed that PIVOT would be the most straightforward method, but others preferred using CASE statements. So which method is more efficient?

I posted a script on GitHub to build a table with 6 million records to run both kinds of queries against. The results of each will be:

Red Blue White Total
3000000 2000000 1000000 6000000

The PIVOT query:

select [Red], [Blue], [White], ([Red] + [Blue] + [White]) as Total
from (
select RecordColor from dbo.TestPivotCase
) as a
pivot (
count(RecordColor) for RecordColor in ([Red], [White], [Blue])
)as p;

The CASE query:

select
  sum(case when RecordColor = 'Red' then 1 else 0 end) as [Red],
  sum(case when RecordColor = 'Blue' then 1 else 0 end) as [Blue],
  sum(case when RecordColor = 'White' then 1 else 0 end) as [White],
  count(*) as [Total]
from dbo.TestPivotCase;

The plans were almost exactly similar in cost, with the PIVOT with a slightly less cost (49.7 % vs. 50.3%).
The execution plans were very similar.
Case Plan:
caseplan

Pivot Plan:
pivotplan

So both methods have a similar execution plan and run time, with a slight edge to the PIVOT method.