Running Totals

November 28, 2017

The guys from the SQL Server Radio podcast posted a blog post from their episode on calculating running totals in SQL Server. Their script (link in the post) covers several different ways to calculate a running total in a query (A total of a value in the current row added to each row that came before it). The best method (in my view) is using the SUM function with the OVER clause. A lot of the other methods involved a cursor or some other messy method.
I imagine most developers familiar with T-SQL have used the OVER clause with ROW_NUMBER or some other windowing function, but the OVER clause can also be used with aggregate functions, like COUNT or SUM or AVG.
An example:

create table [dbo].[RunningTotal](
	RecordId int identity(1,1) NOT NULL,
	GroupId tinyint NOT NULL,
	Amount decimal(6, 2) NOT NULL
);

insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 56.78);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 55.66);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 33.33);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 22.10);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (3, 98.76);

select GroupId, Amount, 
	sum(Amount) over(order by RecordId) as RunningSum,
	sum(Amount) over(partition by GroupId order by RecordId) as RunningSumByGroup
from dbo.RunningTotal;

go

This example gives a running total over the whole dataset, plus a running total by Group (GroupId).

Advertisements

Microsoft SQL Operations Studio

November 27, 2017

Microsoft has released a new database management tool called Microsoft SQL Operations Studio. The new studio will run on Windows, Linux and Mac OS, and is free and open source. It can manage on-premises SQL Servers as well as Axure instances. There’s no installer, you just unzip the files and run the EXE to start the studio.
SSMS users will be familiar with most features: Object Explorer, T-SQL code window with Intellisense, backup and restore, and so on.
There is also built-in support for Git source control. You can also create dashboard widgets to display performance and maintenance data, which they call Insight Widgets. Two examples are provided: One to display the slowest queries and one to show the space used by tables. The performance widget runs off of the Query Store data, so that feature must be enabled.
The most interesting feature was the Integrated Terminal, which allows running command lines tasks from within the Studio, including Command Line, Powershell, and BCP.
Operations Studio is still way short of SSMS on features, although I did like the dashboards and the Integrated Terminal. The basics seem to be covered, and I would expect more features to be added over time. I imagine the main use case is for non-Windows users, although a non-admin Windows user that just needed to query data may find use in the simpler version.

Links:
GitHub
Download SQL Operations Studio
Insight Widgets with examples
Integrated Terminal


Apache Spark On Azure HDInsight – Overview and Deployment

October 31, 2017

Overview:
Apache Spark is an in-memory distributed data processing framework that is part of the Apache open source foundation. Spark was created to work with large datasets utilizing multiple nodes. Developers can use Java, Python, Scala or R to interact with and process large datasets. Generally Spark is used as part of a Hadoop deployment, and can be used to query large datasets, process streaming data, or by data scientists using machine learning libraries to analyze data.

Data Sets:
Previously Spark worked with Resilient Distributed Datasets. RDDs are read-only, in-memory datasets. The data is distributed across several nodes, and the data sets are fault tolerant.
DataSets were introduced in Spark 1.6, which build on RDDs. A DataFrame is a DataSet with named columns. Users of R are familar with a DataFrame, which are used widely in that language.

Components:
Spark Core: The central component, handling task scheduling, memory management, etc.
Spark SQL: Works with structured data, allowing a mixture of SQL queries and data manipulation in DataFrames.
Spark Streaming: Works with data in streams (as opposed to batch processing of data).
MLlib: Machine learning library.
GraphX: Graph data processing.

Deployment:
There are several options for deployment. We can create a standalone cluster, we can download a VM from Hortonworks to deploy in Virtual Box, VM Ware or Docker, or we can setup a cluster an Azure, among other options. I chose to create a cluster in Azure.
To create the cluster in Azure, go to New, then Data + Analytics, then HDInsight. Here you run through a regular Azure setup, User Name/Password, resource group, etc. Once we get to the Cluster Type, we can specify that we want a Spark setup. I chose the latest edition available (Spark 2.1.0) on the Standard tier.
On the next page we specify storage options. We can choose Azure Storage (BLOB) or Data Lake storage, I went with Azure storage. If you don’t already have a storage account, you can set one up here. When specifying a name for the storage account, you aren’t allowed to use upper case letters, only lower case and numbers.
Once you create the cluster, you are billed by the hour. The cluster I created came to $3.63 an hour, with 6 nodes and 40 cores. The deployment took about 15 minutes to complete.

In a future post I’ll run through a Spark demo job and look at some use cases.

Links:
Syncfusion E-book – Spark Succinctly
Introduction to Spark on HDInsight
Apache Spark


SQL Graph

September 19, 2017

One of the new feature in SQL Server 2017 (I’m using RC2) is the addition of graph database capabilities.

What is a Graph Database?
The Neo4j site has a good basic explanation on what a graph database is. Neo4j is the leading Graph database system available. The main concepts are nodes and edges. A node represents an entity and can have properties stored along with it. Edges represent relationships between nodes, and edges can also have properties as well.

Table Creation:
SQL Server implements Nodes and Edges as relational tables. As an example, I’ll create a Player table to store data on players for a football team. The only thing new here is the ‘As Node’ statement at the end of the table creation.

drop table if exists dbo.Player;

create table dbo.Player(
Team char(3) not null,
Number tinyint not null,
FirstName varchar(50),
LastName varchar(50),
Position char(2),
primary key(Team, Number)
) as Node;

insert into dbo.Player values ('ATL', 2, 'Matt', 'Ryan', 'QB');
insert into dbo.Player values ('ATL', 11, 'Julio', 'Jones', 'WR');
insert into dbo.Player values ('ATL', 21, 'Desmond', 'Trufant', 'CB');
insert into dbo.Player values ('ATL', 44, 'Vic', 'Beasley', 'LB');

select * from dbo.Player;

When we selection from this Node table, notice that a ‘Node Id’ column has been added. This stores JSON data with some information on each record, including a numeric Node Id for each record.

{"type":"node","schema":"dbo","table":"Player","id":0}

Edges are created the same way as a Node, with a ‘As Edge’ statement.

drop table if exists dbo.PlatoonsWith;

create table dbo.PlatoonsWith(
PlatoonName varchar(10) not null
) as Edge;

select * from dbo.PlatoonsWith;

As an Edge table, there are attributes added to the table when created. There is the Edge ID, From Id and To Id. The From and To hold the node IDs that are being related.
Once the tables have been created, they can be viewed in SSMS under ‘Tables’, under a ‘Graph Tables’ directory that is new for SQL Server 2017.

Creating Relationships:
We can create relationships between nodes by inserting records into the Edge table. We’ll insert the IDs for the two Nodes that we want to relate, plus any Edge properties. In this example, we’ll relate the two offensive players (Ryan and Jones).

declare @Node1 nvarchar(1000);
declare @Node2 nvarchar(1000);

select @Node1 = [$node_id_4FE94CB5420E440DA5CE8FE7781FA404]
from dbo.Player
where Team = 'ATL' and Number = 2;

select @Node2 = [$node_id_4FE94CB5420E440DA5CE8FE7781FA404]
from dbo.Player
where Team = 'ATL' and Number = 11;

insert into dbo.PlatoonsWith([$from_id_64FB3EBE4E3449D5A59D79FB0F8FA857], [$to_id_6A42816479484605BCD86836DFE8122B], PlatoonName)
values (@Node1, @Node2, 'Offense');

select * from dbo.PlatoonsWith;

Querying Graph Data:
The MATCH keyword was added to T-SQL to aid in querying graph data. The Node table can be joined to the Edge table without the usual JOIN..ON clause, instead using the old style Table1, Table2 syntax.
For an example, we’ll find the offensive player joined to Matt Ryan.

select p1.FirstName + ' ' + p1.LastName as Player1, p1.Position as Player1Position,
	pw.PlatoonName, p2.FirstName + ' ' + p2.LastName as Player2, p2.Position as Player2Position
from dbo.Player as p1, dbo.PlatoonsWith as pw, dbo.Player as p2
where match(p1-(pw)->p2)
	and p1.Number = 2
	and p2.Team = 'ATL';

Deleting Nodes:
There isn’t any sort of protection against deleting nodes. Even if an Edge record refers to a Node, you’re still allowed to delete that Node. The Edge record with the orphaned Node will be left in place.

Wrap Up:
A script with all of the SQL from this post is posted on GitHub.
The Graph functions are definitely useful, but hopefully some improvements will be made. I’m not sure why the auto-generated columns include the extra characters appended to the column names, that does make it a little harder to query, since you have to go retrieve the column name after the table is created. It would also be useful if there were some sort of visualization tool to view the Nodes and their relations.

Links:
SQL Graph Overview – Microsoft
SQL Graph – Niko Neugebauer


Links – Database Articles

September 18, 2017

Links to a few interesting database articles I’ve run across.

Coding Geek: How does a relational database work?: A good overview on relational database architecture.

Simple Talk: SQL Code Smells: List of issues to look for in writing queries and in database design.

Carnegie Mellon: Advanced Database Systems 15-721: Readings from Andy Pavlo’s database internals course – Lectures are also available on YouTube.

Why SQL is beating NoSQL


SQL Server 2017 – New T-SQL Functions

August 14, 2017

Here are some new T-SQL functions added to SQL Server 2017. I’m running the RC 2 version.

TRIM:
SQL Server has separate functions for removing spaces at the start (LTRIM) and end (RTRIM) of a string. 2017 adds the TRIM function to remove leading and trailing spaces together.

declare @Test varchar(20) = '  Test string   ';
select @Test as OriginalString, trim(@Test) as TrimmedString;

Result:
OriginalString TrimmedString
Test string Test string

TRIM

CONCAT_WS:
CONCAT_WS stands for concatenate with separator. The first argument is a string that will be used as the separator. Any number of string arguments can follow and these string will be combined into one string, with the designated separator between each value.

select concat_ws(',', 'First', 'Second', 'Third', 'Fourth');

Result:
First,Second,Third,Fourth

CONCAT_WS

TRANSLATE:
TRANSLATE can be thought of as an extension of REPLACE. We can define a list of characters that we want replaced in a string, and then a list of the characters to replace them with.

select translate('[Test string]', '[]', '()');

Result: (Test string)
TRANSLATE

STRING_AGG:
STRING_AGG is similar to CONCAT_WS, except it will concatenate values from a column instead of a list of values. The second argument is the string to use as the value separator. We can also use the WITHIN GROUP clause to specify the order of the items in the delimited list.

drop table if exists #Test;
create table #Test(Team varchar(20) not null primary key);
insert into #Test values('Falcons'), ('Saints'), ('Panthers'), ('Buccaneers');
select string_agg(Team, ':') from #Test;
select string_agg(Team, ':') within group(order by Team desc) from #Test;

Result:
Buccaneers:Falcons:Panthers:Saints
Saints:Panthers:Falcons:Buccaneers

STRING_AGG

Links:
SQL Server 2017 New Features
SQL Shack – New 2017 String Functions


Azure Cosmos DB

July 17, 2017

Microsoft’s Azure Cosmos DB is a Multi-Model cloud database system, introduced in May 2017. DocumentDB served as Azure’s document database, but it is now part of Cosmos DB.

Set Up:
To set up an instance, we can either connect to Azure or we can use a local emulator to develop against to avoid using Azure credits. For this run through, we’ll connect to Azure.
To create a Cosmos instance in the Azure dashboard, click New, Databases, then select Azure Cosmos DB. After a unique ID is specified, we have to pick the database model to use. We can choose between Gremlin (graph), MongoDB (document), SQL (DocumentDB) or Table (key/value). For this example, I’ll go with Document DB.

Data Manipulation:
Once the instance has been created, the Azure dashboard will bring up the Quick Start page. From here we can create a Collection (the Document DB term for a table). This will create a database called ‘ToDoList’ with a collection named ‘Items’. From here we can download a sample app to connect to this Collection. This app will come already configured to connect to the correct URL, along with an authorization key. Running the solution will start a web app for a simple to-do list.
The Quick Start also links to some sample code and additional documentation.
Within the Azure dashboard, we can also bring up the Data Explorer, which is a graphical tool to view, create, and edit documents. We can also create new collections here.

Settings:
On the Azure dashboard, there are several settings that are of interest.
With ‘Replicate Data Globally’, we can create a read only version of our instance in a different data center. For example, we could have the primary read-write instance in the US East data center, but also have a read only instance in US West, or perhaps in a different country. This will allow is to distribute our data closer to our users, as well as give us a replicated data set in case we need to fail over to the back up instance.
Under ‘Default Consistency’, we can set the consistency level for our data. The levels range from Strong consistency, where any updates must be synchronously committed, to Eventual consistency, which gives the best performance but doesn’t guarantee that all copies of the data are up to date. The default setting is Session consistency, where a session will have strong consistency for its updates but eventual consistency for other sessions.
The ‘Keys’ page will give the connections strings necessary for our applications to connect to the database, either as read-write or as read-only.

Additional Information:
Azure – Cosmos DB Introduction
Cosmos DB – Getting Started
Syncfusion E-Book – Cosmos DB