SQL Server 2017 – In-Database Python Setup

January 15, 2018

With SQL Server 2016, the ability to execute R code in SQL Server was added. SQL Server 2017 added functionality to run Python code as well. This functionality for either language can be installed as part of the SQL Server installation, or added to an existing instance.

Issues:
My initial attempt at Python installation didn’t go smoothly. Multiple attempts at install would hang up and not progress. However, once I removed the Python 3 installation from the machine, the install was able to complete.
The Launchpad service gets installed as part of the Python/R package as well. This service needs to be running in order to execute Python or R code. At first I created a new user for the service, but I ran into errors trying to execute code.
Msg 39021, Level 16, State 1, Line 55
Unable to launch runtime for ‘Python’ script. Please check the configuration of the ‘Python’ runtime.
Msg 39019, Level 16, State 2, Line 55
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).

I ended up giving the service an administrator account to run under, which cleared up this issue.

Configuration:
The ‘External Scripts Enabled’ setting should be set to true in order to run Python/R code.

EXEC sp_configure 'external scripts enabled', 1;
GO
RECONFIGURE;
GO

Execution:
I’ll create a small table to have some test data.

drop table if exists dbo.Team;
create table dbo.Team(Team varchar(20) not null primary key);
insert into dbo.Team values('Falcons'), ('Saints'), ('Panthers'), ('Buccaneers');

And then run a simple loop to read the team names from the table and write them to the results pane.

exec sp_execute_external_script @language = N'Python', @script = N'for x in InputDataSet.Team: print(x)', @input_data_1 = N'select Team from dbo.Team';

InputDataSet is the default name for the data set read in from the database. There is a parameter for the sp_execute_external_script proc to set a different name, if so desired.

Additional Links:
sp_execute_external_script
Set up Python Machine Learning Services

Advertisements

SQL Server Tools List

December 31, 2017

SQL Server Tools
Here is a collection of SQL Server tools I’ve run across. These are all free tools, some from Microsoft and some from the SQL Server community.

SQL Cop: Runs checks on a database to find potential issues, like tables missing primary keys, use of deprecated data types, etc.

Microsoft Assessment and Planning (MAP) Toolkit for SQL Server: Lists all SQL Server instances in an environment,including version, edition as well as information on the host(OS, # of cores, etc).

SQL Query Stress: Developed by Adam Machanic, this is a tool to load test queries. It can create up to 200 threads to create a test load.

SQL Server Diagnostics: An add-in for SSMS (2016 or later). It can analyze memory dumps and give recommendations for issues. It also recommends Cumulative Updates for fixes for specific features being used in a specific instance.

SQL Server Migration Assistant: Tools to move data into SQL Server from Access, Oracle, MySQL and other databases.

Database Experimentation Assistant: Testing program to identify potential issues when upgrading to a newer edition of SQL Server.

Tiger Team Github: Various tools from the SQL Server engineering team. The SQL Performance dashboard reports in particular are very useful.

DiskSpd: Tool for testing storage. Takes the place of the SQLIO utility.

 

 


Running Totals

November 28, 2017

The guys from the SQL Server Radio podcast posted a blog post from their episode on calculating running totals in SQL Server. Their script (link in the post) covers several different ways to calculate a running total in a query (A total of a value in the current row added to each row that came before it). The best method (in my view) is using the SUM function with the OVER clause. A lot of the other methods involved a cursor or some other messy method.
I imagine most developers familiar with T-SQL have used the OVER clause with ROW_NUMBER or some other windowing function, but the OVER clause can also be used with aggregate functions, like COUNT or SUM or AVG.
An example:

create table [dbo].[RunningTotal](
	RecordId int identity(1,1) NOT NULL,
	GroupId tinyint NOT NULL,
	Amount decimal(6, 2) NOT NULL
);

insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 56.78);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 55.66);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 33.33);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 22.10);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (3, 98.76);

select GroupId, Amount, 
	sum(Amount) over(order by RecordId) as RunningSum,
	sum(Amount) over(partition by GroupId order by RecordId) as RunningSumByGroup
from dbo.RunningTotal;

go

This example gives a running total over the whole dataset, plus a running total by Group (GroupId).


Microsoft SQL Operations Studio

November 27, 2017

Microsoft has released a new database management tool called Microsoft SQL Operations Studio. The new studio will run on Windows, Linux and Mac OS, and is free and open source. It can manage on-premises SQL Servers as well as Axure instances. There’s no installer, you just unzip the files and run the EXE to start the studio.
SSMS users will be familiar with most features: Object Explorer, T-SQL code window with Intellisense, backup and restore, and so on.
There is also built-in support for Git source control. You can also create dashboard widgets to display performance and maintenance data, which they call Insight Widgets. Two examples are provided: One to display the slowest queries and one to show the space used by tables. The performance widget runs off of the Query Store data, so that feature must be enabled.
The most interesting feature was the Integrated Terminal, which allows running command lines tasks from within the Studio, including Command Line, Powershell, and BCP.
Operations Studio is still way short of SSMS on features, although I did like the dashboards and the Integrated Terminal. The basics seem to be covered, and I would expect more features to be added over time. I imagine the main use case is for non-Windows users, although a non-admin Windows user that just needed to query data may find use in the simpler version.

Links:
GitHub
Download SQL Operations Studio
Insight Widgets with examples
Integrated Terminal


Apache Spark On Azure HDInsight – Overview and Deployment

October 31, 2017

Overview:
Apache Spark is an in-memory distributed data processing framework that is part of the Apache open source foundation. Spark was created to work with large datasets utilizing multiple nodes. Developers can use Java, Python, Scala or R to interact with and process large datasets. Generally Spark is used as part of a Hadoop deployment, and can be used to query large datasets, process streaming data, or by data scientists using machine learning libraries to analyze data.

Data Sets:
Previously Spark worked with Resilient Distributed Datasets. RDDs are read-only, in-memory datasets. The data is distributed across several nodes, and the data sets are fault tolerant.
DataSets were introduced in Spark 1.6, which build on RDDs. A DataFrame is a DataSet with named columns. Users of R are familar with a DataFrame, which are used widely in that language.

Components:
Spark Core: The central component, handling task scheduling, memory management, etc.
Spark SQL: Works with structured data, allowing a mixture of SQL queries and data manipulation in DataFrames.
Spark Streaming: Works with data in streams (as opposed to batch processing of data).
MLlib: Machine learning library.
GraphX: Graph data processing.

Deployment:
There are several options for deployment. We can create a standalone cluster, we can download a VM from Hortonworks to deploy in Virtual Box, VM Ware or Docker, or we can setup a cluster an Azure, among other options. I chose to create a cluster in Azure.
To create the cluster in Azure, go to New, then Data + Analytics, then HDInsight. Here you run through a regular Azure setup, User Name/Password, resource group, etc. Once we get to the Cluster Type, we can specify that we want a Spark setup. I chose the latest edition available (Spark 2.1.0) on the Standard tier.
On the next page we specify storage options. We can choose Azure Storage (BLOB) or Data Lake storage, I went with Azure storage. If you don’t already have a storage account, you can set one up here. When specifying a name for the storage account, you aren’t allowed to use upper case letters, only lower case and numbers.
Once you create the cluster, you are billed by the hour. The cluster I created came to $3.63 an hour, with 6 nodes and 40 cores. The deployment took about 15 minutes to complete.

In a future post I’ll run through a Spark demo job and look at some use cases.

Links:
Syncfusion E-book – Spark Succinctly
Introduction to Spark on HDInsight
Apache Spark


SQL Graph

September 19, 2017

One of the new feature in SQL Server 2017 (I’m using RC2) is the addition of graph database capabilities.

What is a Graph Database?
The Neo4j site has a good basic explanation on what a graph database is. Neo4j is the leading Graph database system available. The main concepts are nodes and edges. A node represents an entity and can have properties stored along with it. Edges represent relationships between nodes, and edges can also have properties as well.

Table Creation:
SQL Server implements Nodes and Edges as relational tables. As an example, I’ll create a Player table to store data on players for a football team. The only thing new here is the ‘As Node’ statement at the end of the table creation.

drop table if exists dbo.Player;

create table dbo.Player(
Team char(3) not null,
Number tinyint not null,
FirstName varchar(50),
LastName varchar(50),
Position char(2),
primary key(Team, Number)
) as Node;

insert into dbo.Player values ('ATL', 2, 'Matt', 'Ryan', 'QB');
insert into dbo.Player values ('ATL', 11, 'Julio', 'Jones', 'WR');
insert into dbo.Player values ('ATL', 21, 'Desmond', 'Trufant', 'CB');
insert into dbo.Player values ('ATL', 44, 'Vic', 'Beasley', 'LB');

select * from dbo.Player;

When we selection from this Node table, notice that a ‘Node Id’ column has been added. This stores JSON data with some information on each record, including a numeric Node Id for each record.

{"type":"node","schema":"dbo","table":"Player","id":0}

Edges are created the same way as a Node, with a ‘As Edge’ statement.

drop table if exists dbo.PlatoonsWith;

create table dbo.PlatoonsWith(
PlatoonName varchar(10) not null
) as Edge;

select * from dbo.PlatoonsWith;

As an Edge table, there are attributes added to the table when created. There is the Edge ID, From Id and To Id. The From and To hold the node IDs that are being related.
Once the tables have been created, they can be viewed in SSMS under ‘Tables’, under a ‘Graph Tables’ directory that is new for SQL Server 2017.

Creating Relationships:
We can create relationships between nodes by inserting records into the Edge table. We’ll insert the IDs for the two Nodes that we want to relate, plus any Edge properties. In this example, we’ll relate the two offensive players (Ryan and Jones).

declare @Node1 nvarchar(1000);
declare @Node2 nvarchar(1000);

select @Node1 = [$node_id_4FE94CB5420E440DA5CE8FE7781FA404]
from dbo.Player
where Team = 'ATL' and Number = 2;

select @Node2 = [$node_id_4FE94CB5420E440DA5CE8FE7781FA404]
from dbo.Player
where Team = 'ATL' and Number = 11;

insert into dbo.PlatoonsWith([$from_id_64FB3EBE4E3449D5A59D79FB0F8FA857], [$to_id_6A42816479484605BCD86836DFE8122B], PlatoonName)
values (@Node1, @Node2, 'Offense');

select * from dbo.PlatoonsWith;

Querying Graph Data:
The MATCH keyword was added to T-SQL to aid in querying graph data. The Node table can be joined to the Edge table without the usual JOIN..ON clause, instead using the old style Table1, Table2 syntax.
For an example, we’ll find the offensive player joined to Matt Ryan.

select p1.FirstName + ' ' + p1.LastName as Player1, p1.Position as Player1Position,
	pw.PlatoonName, p2.FirstName + ' ' + p2.LastName as Player2, p2.Position as Player2Position
from dbo.Player as p1, dbo.PlatoonsWith as pw, dbo.Player as p2
where match(p1-(pw)->p2)
	and p1.Number = 2
	and p2.Team = 'ATL';

Deleting Nodes:
There isn’t any sort of protection against deleting nodes. Even if an Edge record refers to a Node, you’re still allowed to delete that Node. The Edge record with the orphaned Node will be left in place.

Wrap Up:
A script with all of the SQL from this post is posted on GitHub.
The Graph functions are definitely useful, but hopefully some improvements will be made. I’m not sure why the auto-generated columns include the extra characters appended to the column names, that does make it a little harder to query, since you have to go retrieve the column name after the table is created. It would also be useful if there were some sort of visualization tool to view the Nodes and their relations.

Links:
SQL Graph Overview – Microsoft
SQL Graph – Niko Neugebauer


Links – Database Articles

September 18, 2017

Links to a few interesting database articles I’ve run across.

Coding Geek: How does a relational database work?: A good overview on relational database architecture.

Simple Talk: SQL Code Smells: List of issues to look for in writing queries and in database design.

Carnegie Mellon: Advanced Database Systems 15-721: Readings from Andy Pavlo’s database internals course – Lectures are also available on YouTube.

Why SQL is beating NoSQL