Azure Cosmos DB

July 17, 2017

Microsoft’s Azure Cosmos DB is a Multi-Model cloud database system, introduced in May 2017. DocumentDB served as Azure’s document database, but it is now part of Cosmos DB.

Set Up:
To set up an instance, we can either connect to Azure or we can use a local emulator to develop against to avoid using Azure credits. For this run through, we’ll connect to Azure.
To create a Cosmos instance in the Azure dashboard, click New, Databases, then select Azure Cosmos DB. After a unique ID is specified, we have to pick the database model to use. We can choose between Gremlin (graph), MongoDB (document), SQL (DocumentDB) or Table (key/value). For this example, I’ll go with Document DB.

Data Manipulation:
Once the instance has been created, the Azure dashboard will bring up the Quick Start page. From here we can create a Collection (the Document DB term for a table). This will create a database called ‘ToDoList’ with a collection named ‘Items’. From here we can download a sample app to connect to this Collection. This app will come already configured to connect to the correct URL, along with an authorization key. Running the solution will start a web app for a simple to-do list.
The Quick Start also links to some sample code and additional documentation.
Within the Azure dashboard, we can also bring up the Data Explorer, which is a graphical tool to view, create, and edit documents. We can also create new collections here.

Settings:
On the Azure dashboard, there are several settings that are of interest.
With ‘Replicate Data Globally’, we can create a read only version of our instance in a different data center. For example, we could have the primary read-write instance in the US East data center, but also have a read only instance in US West, or perhaps in a different country. This will allow is to distribute our data closer to our users, as well as give us a replicated data set in case we need to fail over to the back up instance.
Under ‘Default Consistency’, we can set the consistency level for our data. The levels range from Strong consistency, where any updates must be synchronously committed, to Eventual consistency, which gives the best performance but doesn’t guarantee that all copies of the data are up to date. The default setting is Session consistency, where a session will have strong consistency for its updates but eventual consistency for other sessions.
The ‘Keys’ page will give the connections strings necessary for our applications to connect to the database, either as read-write or as read-only.

Additional Information:
Azure – Cosmos DB Introduction
Cosmos DB – Getting Started
Syncfusion E-Book – Cosmos DB


Number To Text

June 25, 2017

I ran across a programming challenge question on Reddit, asking to create a program that will take a number and convert it to its text representation, so 11 to Eleven, 103 to One Hundred Three, etc.

I’ve posted a script on Github to accomplish this in T-SQL.

Previously, I had created a similar script to convert a Roman numeral to a number.


SSMS 2017 Table Properties

June 4, 2017

A few new things I’ve noticued snder table properties in SSMS 2016 and 2017.
Table Properties – Microsoft Docs

1) I noticed a new page for ‘Security Predicates’ and wasn’t quite sure what this displays. I finally figured out this refers to (Row Level Security. A security policy is created, which links a table to the function that will determine if a row will be displayed or not. As part of the policy a predicate is created, either a Filter (filter rows read) or a Block (block a write) predicate.
This information is also available by querying the sys.security_predicates view.

2) Under ‘General’, there is a ‘GraphDB’ section with two properties ‘Table is a node table’ and ‘Table is an edge table’. In the upcoming release of the SQL Server 2017 engine there will be graph database functionality added.
Microsoft Docs – SQL Graph
It looks like SQL Server will implement the graph database with separate tables for nodes and edges.


SQL Server and R Integration – Part 2 – Running R Code

May 27, 2017

In a previous post I covered setting up SQL Server for R language integration. Now that we are set up, we can put this functionality to use.

Simple R Function Call:
First we’ll get the average from a list of numbers retrieved from SQL Server using one of R’s built-in functions. First we’ll make a list of numbers.

drop table if exists dbo.RTest;
go
create table dbo.RTest (
Measure tinyint not null
);
go

insert into dbo.RTest(Measure) values (1);
insert into dbo.RTest(Measure) values (3);
insert into dbo.RTest(Measure) values (4);
insert into dbo.RTest(Measure) values (7);
insert into dbo.RTest(Measure) values (8);
insert into dbo.RTest(Measure) values (10);
go 10000

select count(*) as RecordCount from dbo.RTest;
go

This will give us a list of 60,000 numbers. Now it is easy enough to find the average of all of these numbers in T-SQL:

select avg(Measure) from dbo.RTest;

So a few things about R before we run some R code. R is case sensitive. We use When results are returned from R to SQL Server, they will need to be as a data frame.

Let’s use use R to find the average of the ‘Measure’ values in our test table. Here is the stored procedure call:

EXEC sp_execute_external_script  
  @language=N'R',    
  @script=N'OutputDataSet <-data.frame(mean(InputDataSet$Measure))',      
  @input_data_1=N'SELECT Measure FROM dbo.RTest'   
  WITH RESULT SETS (([Average] tinyint NULL)); 

The ‘sp_execute_external_script’ stored procedure is used to execute R code. In the @language parameter we specify ‘R’ since we want to run R code. In the @input_data_1 parameter, we specify the data set that we want to pass in, in this case we select all of the records from our 1 column table. The @script parameter is where we place the R code. Any data returned by the code in the @input_data_1 parameter is visible in our R code as ‘InputDataSet’ (We can use a different name by specifying in a different parameter, but we’ll go with the defaults to keep it simple).
Even though our input data set has only one attribute (Measure), it should be specified, InputDataSet$Measure. We use the R function mean to find the mean value of all of the Measure values. So even though it is a scalar value being returned, it needs to be converted to a data frame, which is assigned to the default output variable OutputDataSet.
For the output, we use RESULT SETS to specify the name, data type and nullability of each attribute in the data set. In this case, we get the value 5 as ‘Average’.

Installing R Packages:
One of the powerful things about R is all of the external libraries of functionality that are freely available to use. If we want to explore machine learning or advanced statistical functions, we’ll need to find and install these packages.
RStudio Support has a list of the most popular R packages.
Once we find a package that we’d like to use, we’ll need to download and install it. We’ll install stringr, which has regular expression and other string manipulation functionality.
Running install.packages in the sp_execute_external_script proc didn’t work for me, I got a message that the package wasn’t available.
SQL Server Central – Installing R packages in SQL Server R Services gives some other options to install packages.
I ended up using the R command line option to install stringr. Since the package was dependent on two other packages, they were retrieved and installed as well.

Other Links:
MSDN – Using R Code in Transact-SQL
Simple Talk


Multi-Model Databases

April 15, 2017

NoSQL databases offer alternatives for situations where the data being processed doesn’t fit neatly into a relational database. Different types of systems have evolved to meet different needs. Some have adopted polyglot persistence, using multiple databases systems to meet specialized needs within a system. However, such a system greatly increases the complexity in programming and maintenance.
Some relational systems have added features to deal with alternate data structures. Postgres supports the JSON data type, which would allow you to replicate features of a document database, while SQL Server has supported XML for a while, and looks to be moving toward supporting JSON as well.
One solution is Multi-Model databases, which combine several database types into one system. This gives you one system to install and maintain. Plus, one query can access data from the various models, and one transaction to ensure consistency across those models.
Two system that I’ve looked at are ArangoDB and OrientDB.
Both systems support Document, Key/Value, and Graph database types in one engine. OrientDB supports SQL, while Arrango had a SQL-like query language, although for selects the language seems to resemble LINQ where the table/collection is specified first.
Both can operate as a single node or as a cluster, and have built in replication and sharding ability.
The multi-model system is an interesting idea and seems quite useful. I’ll follow up with a post with a deeper dive into one of these systems. It is interesting that neither system implements a column family model, perhaps that falls more into a analytical system that would be separate from the transacional system anyway.

Links:
10 Reasons To Consider A Multi-model Database
Data Modeling With Multi-model Databases
Datastax Believes Multi-model Databases Are The Future


SQL Antipatterns by Bill Karwin

March 27, 2017

One of my favorite database books is SQL Antipatterns by Bill Karwin. The book lists many approaches taken in database design, querying and development that may be suboptimal, and gives different approaches to solve a particular issue.
For each issue, the author will list the objective, describe the antipattern and how to recognize it, legitimate use cases for the antipattern, and a suggestion on an alternate way of achieving the objective.
Here is a list of the antipatterns described, along with the solution to each. You’ll want to read the book to get more detail on each item.

Logical Database Design

1. Storing comma separated list of values.
Solution: Use an intersection table for a many to one relationship

2. Hierarchy – storing a parent ID – Adjacency list
Solution: Use a path structure (1/5/7/)

3. Always having an incrementing integer as the PK
Solution: Consider natural and compound keys

4. Not using constraints, thinking that leads to simplier design
Solution: Use constraints

5. Using the Entity-Attribute-Value design
Solution: Each entity has its own table

6. Using a “Dual-Purpose foreign key” – Storing an ID that could refer to values from multiple tables, plus a 2nd column to store the table name.
Solution: Set up a proper foreign key relationship

7. Repeating columns – Column1, Column2, Column3
Solution: Set up one to many relationship

8. Splitting data in separate tables based on values – Sales2015, Sales2016, Sales2017
Solution: Use partitioning

Physical Database Design

1. Using the Float data type to store fractional values when exact values are important
Solution: Solution: Use the Numeric type

2. Specifying valid values for a column in a check constraint
Solution: Solution: Put valid values in a table and create a foreign key

3. Storing paths to external files – Stored outside the database where they can’t be managed by the database system.
Solution: Store data in a Blob, or use Filestream to manage the files

4. No indexes or creating too many indexes
Solution: Analyze usage data to understand where indexes are needed

Query

1. Not allowing NULLs and using a different value to represent missing data
Solution: Use NULL to represent missing data

2. Listing columns in a SELECT clause that aren’t either listed in a GROUP BY clause or contained in an aggregate function (Not valid in SQL Server, which will raise an error)
Solution: Follow the rules for GROUP BY

3. In trying to pick a record at random, sorting the dataset by a random number (Like RAND or NEWID) and taking the first record.
Solution: Find other ways to pick a random value. In SQL Server, use TABLESAMPLE (1 ROW).

4. Simulating a search engine by using LIKE ‘%SearchTerm%’.
Solution: In SQL Server, use full-test search

5. Trying to return complex datasets all in one query – Lots of times there are unintended results, like Cartesian joins.
Solution: Break a complex problem into smaller steps

6. Depending on column order, with SELECT * or INSERT without explicitly listing columns.
Solution: Avoid wildcard queries and explicitly list columns for INSERTs

Application Development

1. Storing passwords in plain text
Solution: Store a salted hash of the password

2. Added unverified text to a SQL query – Open to SQL injection.
Solution: Filter input or use parameterized queries.

3. Backfilling any gaps in identity columns.
Solution: Treat keys as a uniwue indentifier, not as a row number.

4. Ignoring return values from a database API in an effort to have simpler or less code.
Solution: Store any dynamically generated SQL and return values from executing queries to aid in troubleshooting.

5. Not treating SQL as code – No source control, tests, etc.
Solution: Treat SQL as any other application code.

6. MVC – Using the Model as an Active Record.
Solution: Decouple the model from the tables.


Calendar Table And Marking Holidays

March 15, 2017

A lot of Data Marts or reporting databases will include a Calendar or Dates table to list all days within a certain range, along with data on those days (Day of the week, quarter, etc.).
I’ve posted my script on GitHub that will generate these records.
At a previous job, we wanted to track days the office was closed so that we could measure how many business days certain functions took. Determining weekends was easy, but calculating holidays took a little more effort. You could just manually enter the holidays for each year, but it is possible to calculate the holidays for each year in one pass.
Some holidays are the same date every year, like Christmas.

UPDATE dbo.Calendar SET IsHoliday = 1 WHERE [Month] = 12 AND [Day] = 25;

Others are on a specific day of the week, like Labor Day being the first Monday in September (This script is marking US holidays).

UPDATE c
SET IsHoliday = 1
FROM dbo.Calendar as c
JOIN (
	SELECT DateKey,
		ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY DateKey) AS RowNumber
	FROM dbo.Calendar
	WHERE [Month] = 9
		AND [WeekDay] = 2
) AS z
ON z.DateKey = c.DateKey
	AND z.RowNumber = 1;

This statement will find the first Monday of each September for the range of dates in the table, and mark it as a holiday.
Another method I saw used was to look for a Monday in September where the day was between 1 and 7, since the first Monday of the month will always be in that range.
Of course, the hardest holiday to calculate is Easter. Luckily, I found an algorithm on the US Naval Observatory site to determine the date. I’ve included a function in the Calendar script to calculate the date.
Another thing to consider (which I didn’t include in this script) is if a holiday falls on a weekend, a business may choose the nearest weekday to observe that holiday.