Calendar Table And Marking Holidays

March 15, 2017

A lot of Data Marts or reporting databases will include a Calendar or Dates table to list all days within a certain range, along with data on those days (Day of the week, quarter, etc.).
I’ve posted my script on GitHub that will generate these records.
At a previous job, we wanted to track days the office was closed so that we could measure how many business days certain functions took. Determining weekends was easy, but calculating holidays took a little more effort. You could just manually enter the holidays for each year, but it is possible to calculate the holidays for each year in one pass.
Some holidays are the same date every year, like Christmas.

UPDATE dbo.Calendar SET IsHoliday = 1 WHERE [Month] = 12 AND [Day] = 25;

Others are on a specific day of the week, like Labor Day being the first Monday in September (This script is marking US holidays).

UPDATE c
SET IsHoliday = 1
FROM dbo.Calendar as c
JOIN (
	SELECT DateKey,
		ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY DateKey) AS RowNumber
	FROM dbo.Calendar
	WHERE [Month] = 9
		AND [WeekDay] = 2
) AS z
ON z.DateKey = c.DateKey
	AND z.RowNumber = 1;

This statement will find the first Monday of each September for the range of dates in the table, and mark it as a holiday.
Another method I saw used was to look for a Monday in September where the day was between 1 and 7, since the first Monday of the month will always be in that range.
Of course, the hardest holiday to calculate is Easter. Luckily, I found an algorithm on the US Naval Observatory site to determine the date. I’ve included a function in the Calendar script to calculate the date.
Another thing to consider (which I didn’t include in this script) is if a holiday falls on a weekend, a business may choose the nearest weekday to observe that holiday.


BETWEEN

March 6, 2017

While writing a query recently, I made use of the BETWEEN operator, which will match all values within a specified range. Without thinking, I put the greater value first:
WHERE RecordId BETWEEN 100 and 90

I was somewhat surprised that no records were returned, I knew there should be matches. Reversing the order gave me the results I expected:
WHERE RecordID BETWEEN 90 and 100.

So running the query and getting the execution plan, I saw that the WHERE clause was transformed to:
WHERE RecordID >= 90 and RecordID <= 100

Seeing this, it becomes obvious why the order would matter.

MSDN – BETWEEN


SQL Server and R Integration – Setup

February 27, 2017

A new feature in SQL Server 2016 is the ability to run R scripts within SSMS. For those not familiar with the R language, it is an open source language used to performed statistical computations, to make use of machine learning algorithms, and for data visualization.
In this post, I’ll go through a quick explanation on installing and setting up the services.

Background:
The R language comes with a great deal of built-in functionality, but its real power comes from the hundreds of packages available, created by the R community to extend the functionality of R. R is an interpreted language, using vectors and data frames (similar to tables) as the primary data structures.
R can connect to SQL Server to retrieve data, but there are a few drawbacks. For one, R works with data in memory, so your dataset is limited to the amount of memory available in your workspace. Also, R is single-threaded, so we can’t take advantage of distributing the workload. Working with R within SQL Server will allow us to use a larger dataset as well as take care of

Installation:
The R components for SQL Server are installed from SQL Server setup. If the services weren’t installed with the initial installation of the database engine, you can go back and add the components.
One option is to install a stand alone R Server, that would be used as a separate instance to use in analyzing data using R. There is also an option to install in-database R services. This optional also installs an additional service, the SQL Server Launchpad, that allows integration with R. In my testing, I elected to work with the in-database component with my existing SQL Server instance.

Setup:
Once the components have been installed, the ‘External Scripts Enabled’ setting will need to be set to true.

EXEC sp_configure 'external scripts enabled', 1;
GO
RECONFIGURE;
GO

We’ll also need to restart the database instance to have this setting take effect. We also need to make sure that the SQL Server Launchpad service is running.

Test:
We’ll make a call to the sp_execute_external_script stored procedure to make sure that R is properly set up and that we can make a call.
I found this sample code on MSDN that will allow us to test the R set-up without writing R code or retrieving data from tables right now.

EXEC sp_execute_external_script  
  @language =N'R',    
  @script=N'OutputDataSet<-InputDataSet',      
  @input_data_1 =N'SELECT 1 AS hello'    
  WITH RESULT SETS (([hello] int not null));    
GO 

If everything is setup correctly, we should get a single value of 1 returned with the column header ‘Hello’.

Next Steps:
In the next post, I’ll put together some actual R code and look at some of the things we can do with SQL Server data and R.

Additional Information:
SQL Server Central: Introduction to Microsoft R Services in SQL Server 2016
Syncfusion E-Book: R Succinctly – Introduction to the R language – Free, login required
Coursera – Data Science: Series of courses in Data Science, several deal with learning R.


Integers vs Strings For Joins

January 31, 2017

One recurring argument that comes up with databases is determining a primary key for a table. Should a natural key (which may be a string value) be used, or should a integer surrogate key be generated? In terms of query performance, the common wisdom is that integer keys will perform better. I’ve always been doubtful about that claim, I can see where if values were of different sizes then the performance would be different.
I ran across this post that compared string vs integer key values for query performance.
I wanted to perform my own test to compare the performance. I’ve posted a script on Github to build out some test tables. We end up with a Customer table with a little over 2 million records. We’ll use two different queries to join to a State table, one joining on an integer StateId value, the 2nd joining on a string StateCode. The StateId will be a smallint, and the StateCode a char(2) value, so that both columns are 2 bytes. The Customer table has a clustered index on the Customer ID, with nonclustered indexes on both the StateId and StateCode columns. The State table is a heap (it only has 62 records), since ordering by either StateId or StateCode may give an advantage to that type.
So with running the two queries, the first joining on StateId is 46% of the run time, with the StateCode string join as the other 54%. So the integer column key is faster in this case. However, on examining the query plan, Hash joins were used, so certainly we can get a more efficient plan.
I went back and added INCLUDE columns to the two Customer indexes, so that all of the returned attributes are in the index leaf level. Once I made this change, the query plan showed that the queries used Merge joins, and the query cost is the same for both queries.
So my conclusion is that there is no gain in query performance when joining on strings or integers, assuming that the values are of the same size. I need to do more research on Hash joins, but it appears that a string value hash may possibly be larger than a integer value hash, which would account for the difference in query performance.


Links – Database Theory

December 31, 2016

A few interesting links of database information I’ve run across:
Armstrong’s Axioms: Rules on functional dependencies.
Logic For Serious Database Folks: Chapters from a work in progress on Logic as it pertains to relational databases, no release date set for the complete work.
A Simple Guide to Five Normal Forms in Relational Database Theory: Covering the normal forms, most resources don’t go past Third Normal Form.
Comparison of different SQL implementations: Very comprehensive list of how several relational database products adhere or stray from the ANSI SQL standard.


Links – Computer Science Basics

December 31, 2016

Now that I’ve completed my Bachelor’s (non Computer Science) degree, I’m going back to fill in some gaps in my knowledge of the basics of Computer Science. Here are some interesting links I’ve run across. A lot of these are lists of items that a good programmer should know, and some actually cover some basic material.

Oregon State U – Post-Baccalaureate: Oregon State has a one year online program for students to earn a 2nd Bachelor’s degree. A bit pricey at 28K, but still interesting to see the curriculum.
Programmer Competency Matrix: A good list of what a good programmer should know at various levels of their career.
Interview Cake: The ‘Glossary’ link at the top of the page has links to good explainations of basic terms in Computer Science.
aGupieWare – Bachelor’s Curriculum: A list of the general areas that should be covered by a basic Computer Science education, with links to recommended resources to cover those areas.
Syncfusion: Data Structures I and Data Structures II: Two free e-Books from Syncfusion that cover the basic data structures (Registration required).
Steve Yegge – Five Essential Phone Screen Questions: A good list of the skills an interviewer expects during an interview.
GitHub Gist: Technical Interview Cheet Sheet: A short list to cover basic data structures and algorithms.
Computer Science From The Bottom Up: Good online book (PDF and ePub available) on Computer Science basics.
Rob Conery – The Imposter’s Handbook: A great book that covers Computer Science basics, like Big O notation, data structures, and so on. The e-book is $30, a printed version isn’t yet available.


Degree Completion

November 30, 2016

This post won’t be a technical one, but I recently went back and completed my bachelor’s degree, 29 years after I first entered college. I wanted to put together a write up of the path I took to complete, in case this information was useful to anyone else trying to complete their degree.

My Educational Background
Out of high school, I attended two different schools for 2 years each, so I had plenty of credits built up. I studied political science at the first school and philosophy at the second. I also had some accounting classes from a local tech school as well. To be honest, I was looking for the shortest route to completion, as I was working full time and was already deep into my career. Most schools have a residency requirement, meaning that you need to complete the last 1-2 years or so of your education at that school in order to graduate.

Selecting A School
Reputable US schools will have a regional accreditation. Georgia is covered by the Southern Association of Colleges and Schools. Most schools will require a school to be regionally accredited in order to accept transfer credits from them, or to enter a graduate program. So I was looking for a school with regional accreditation.
I also hoped to find a school that didn’t have a residency requirement, so that I would be able to use as many of my previously earned credits as possible, as well as shorten that amount of time I would spend in completion.
I ended up finding two schools that met these requirements and applied to both of those:
Excelsior College
Thomas Edison State University (TESU)

When I applied, the schools gave me a rough idea of which classes would be accepted as transfer, as well as which category they would be credited as. Once I enrolled at the school, I got an official evaluation of my credits.
Thomas Edison offered to transfer more credits so I enrolled with them.
I later learned of a third school that only requires two classes to be taken at their institution: Charter Oak State College

Enrolling
Once I enrolled at TESU, I met with a guidance counselor to go over my requirements. Everything laid out in my initial evaluation was approved. I had to pay an enrollment fee good for one year, as well as a fee for each class taken.

Selecting A Major
My credits were spread across several areas in the liberal arts, so after checking Thomas Edison’s requirements, I was the closest to earning a Liberal Studies degree.

Earning Credits
There was one Capstone class that I had to take at TESU, but I was free to earn the rest of the credits any way that I could, I didn’t necessarily have to take them from the school.
I had most of my pre-requisites completed, but one good resource for these classes is Georgia’s ECore site.
I ended up taking several classes from a third party company, Straighterline. Here you pay a monthly subscription fee ($99 when I took classes) plus a fee per class($50 or so). You take classes at your own pace, taking as long as you want or working through as quickly as you like. Each class used a textbook, which were easily bought used online. Readings in the text were assigned, and each module had a quiz that was administered online. Most classes had a midterm as well. Each class had a final exam, which was proctored by a third party company. You would take a final exam at home, but the proctor would monitor you via webcam as you tested, as well as monitor your desktop while you tested. You would schedule a time with the proctor, setup with your assigned proctor, and then take the test. Everything was graded immediately. Their site posted a list of all of their classes that TESU would accept, along with the class that it would be credited as, which was very useful in planning. Once you pass a class, you can have StraighterLine send a transcript to the school. Even though you are given a letter grade for the class, the credits will be transferred in as pass/fail. I found this the easiest and cheapest way to earn credits.
There are also several ways to earn credits by exam. For these, you’ll travel to a testing center to take the exam. The testing centers are mostly colleges that open their testing center to outside students (not all schools allow that). I took exams from two agencies:
DSST(formerly known as DANTES)
CLEP

Both sets of exams were under $100 an exam. As long as you receive a passing grade, you will receive credit. Since the exams are pass/fail you don’t get a letter grade, it will just show up as credit on your transcript.
TESU publishes lists of the exams that they accepted for credit, and the class that you would get credit for.
To study for the exams, I used a site Instacert to prepare for the exams. They are a subscription site ($20 a month when I used them) that have practice questions with answers, I found them very useful. The site Free CLEP Prep also had links to other resources as well.
The exams were tough, but with a good bit of studying I was able to pass every attempt. I stuck to areas that I already knew very well, like history.
TESU and Exclesior also offer their own sets of exams for credit, but I didn’t take any of those.
Most exams are credited for lower level (100 or 200 level) classes. I did find a few that would transfer in for upper level credit. The History of the Vietnam War DSST exam was one, as well as the Introduction to Religion StraighterLine class.

Capstone
After filling in my requirements, I was left with one class to take at TESU, their capstone class. Here, the sole activity of the class was to write a research paper, kind of a mini-thesis. We were free to select any topic in our major area. Computer Science fell under the Liberal Studies umbrella here, so I was free to select a programming-related topic. Each class has a professor assigned, who will help to formulate a topic. Every two weeks I would submit a chapter and would receive feedback on it. At the end of the class we would submit the full paper. It wasn’t a typical research paper, a lot of the writing was about my research methods as well. The paper had to be at least 25 pages, not including the bibliography and other supporting writings. A minimum grade of ‘C’ was needed.

Graduation
Before the class was over, I was eligible to go ahead and apply for graduation, of course pending on getting a high enough grade. Once I completed the class, I was eligible to graduate in September of 2016.