SQL Antipatterns by Bill Karwin

March 27, 2017

One of my favorite database books is SQL Antipatterns by Bill Karwin. The book lists many approaches taken in database design, querying and development that may be suboptimal, and gives different approaches to solve a particular issue.
For each issue, the author will list the objective, describe the antipattern and how to recognize it, legitimate use cases for the antipattern, and a suggestion on an alternate way of achieving the objective.
Here is a list of the antipatterns described, along with the solution to each. You’ll want to read the book to get more detail on each item.

Logical Database Design

1. Storing comma separated list of values.
Solution: Use an intersection table for a many to one relationship

2. Hierarchy – storing a parent ID – Adjacency list
Solution: Use a path structure (1/5/7/)

3. Always having an incrementing integer as the PK
Solution: Consider natural and compound keys

4. Not using constraints, thinking that leads to simplier design
Solution: Use constraints

5. Using the Entity-Attribute-Value design
Solution: Each entity has its own table

6. Using a “Dual-Purpose foreign key” – Storing an ID that could refer to values from multiple tables, plus a 2nd column to store the table name.
Solution: Set up a proper foreign key relationship

7. Repeating columns – Column1, Column2, Column3
Solution: Set up one to many relationship

8. Splitting data in separate tables based on values – Sales2015, Sales2016, Sales2017
Solution: Use partitioning

Physical Database Design

1. Using the Float data type to store fractional values when exact values are important
Solution: Solution: Use the Numeric type

2. Specifying valid values for a column in a check constraint
Solution: Solution: Put valid values in a table and create a foreign key

3. Storing paths to external files – Stored outside the database where they can’t be managed by the database system.
Solution: Store data in a Blob, or use Filestream to manage the files

4. No indexes or creating too many indexes
Solution: Analyze usage data to understand where indexes are needed

Query

1. Not allowing NULLs and using a different value to represent missing data
Solution: Use NULL to represent missing data

2. Listing columns in a SELECT clause that aren’t either listed in a GROUP BY clause or contained in an aggregate function (Not valid in SQL Server, which will raise an error)
Solution: Follow the rules for GROUP BY

3. In trying to pick a record at random, sorting the dataset by a random number (Like RAND or NEWID) and taking the first record.
Solution: Find other ways to pick a random value. In SQL Server, use TABLESAMPLE (1 ROW).

4. Simulating a search engine by using LIKE ‘%SearchTerm%’.
Solution: In SQL Server, use full-test search

5. Trying to return complex datasets all in one query – Lots of times there are unintended results, like Cartesian joins.
Solution: Break a complex problem into smaller steps

6. Depending on column order, with SELECT * or INSERT without explicitly listing columns.
Solution: Avoid wildcard queries and explicitly list columns for INSERTs

Application Development

1. Storing passwords in plain text
Solution: Store a salted hash of the password

2. Added unverified text to a SQL query – Open to SQL injection.
Solution: Filter input or use parameterized queries.

3. Backfilling any gaps in identity columns.
Solution: Treat keys as a uniwue indentifier, not as a row number.

4. Ignoring return values from a database API in an effort to have simpler or less code.
Solution: Store any dynamically generated SQL and return values from executing queries to aid in troubleshooting.

5. Not treating SQL as code – No source control, tests, etc.
Solution: Treat SQL as any other application code.

6. MVC – Using the Model as an Active Record.
Solution: Decouple the model from the tables.

Advertisements

Data Science Laboratory System

October 24, 2014

Data Science Laboratory System:

Buck Woody has a great series of posts where he runs through the setup and use of several different types of database systems as well as other data processing tools.
The last post in the series is at:
Buck Woody: Data Science Laboratory System
Which links back to the previous posts as well.

In past months I’ve copied Mr Woody’s idea and worked my way through various other database systems:

New SQL – Distributed SQL Database(VoltDB)

Graph Database (Neo4j)

Document Database (MongoDB)

Key/Value Database (Redis)

Column Store Database (Cassandra)

SQL Database (PostGreSQL)


Beginning SQL

June 26, 2014

A friend was interested in learning SQL, so I wanted to put together a reference to set up a learning environment.

Online Execution:

The easiest course is to find an online resource to run SQL. A great resource is at SQL Fiddle. This will allow you to setup some tables in the Schema Panel, and then run commands in the query window.
I’ve posted a script at Github that will build and populate some simple tables to use as a data source. First, select the engine version to one of the SQL Server options (MS SQL Server 2012), the default is set to MySQL. Paste the code into the Schema Panel and click ‘Build Schema’. Once that’s completed, you can run SQL commands in the right-hand query window, and results will appear in the bottom pane.

SQL Server Express:

Another option would be to install the SQL Server tools locally to run commands.
SQL Server Express is available as a free download. From this page, select ‘Download’ and select ‘SQLEXPRWT_x64_ENU.exe’ (assuming you want the 64 bit install) then click ‘Next’. Make sure to get the edition with tools, to make sure you get SSMS. You can take most of the defaults, although I recommend choosing Mixed Authentication mode instead of Windows Authentication only.
Scott Hanselman also has a blog post to link to the download page.
Most examples seen for SQL Server are against the Adventureworks database. It is no longer included with the SQL Server install, but it is available to download.

Reference and Tutorials:

Zed Shaw has a great series of programming primers in the ‘Hard Way’ series. He has SQL resources available at SQL The Hard Way.

The best reference for T-SQL will be SQL Server Books Online.


VoltDB – Part 2 – Creating and Starting Database

March 7, 2014

My previous post on VoltDB went over installing the product. In this post, I was going to go over creating a database and working with data.

Creating and Starting database:
First, you put together a SQL script with the schema definition. For my test, I just created a couple of tables, using standard data types (int, varchar, etc.), then save the SQL script to my home directory.
Second, compile the schema definition into what VoltDB calls an ‘Application Catalog’. From the command line, run:
voltdb compile -o test.jar test.sql

where test.sql is the table definitions put together in step 1, and test.jar is the output file.
This step will also create insert, update and delete procedures for each table. Commands and file names in the terminal emulator are case-sensitive.

Third, the JAR database definition is used to start the database. From the command line, run:
voltdb create test.jar

Once you see ‘Server completed initialization’ then the database is ready to use. In the command window, select ‘File’ => ‘Open Tab’ to open a 2nd command window. Run command:
voltdb –version

to see the version number.

Working With Data:
In that same 2nd window, run the command:
sqlcmd

to start the SQL command interface. You should see ‘SQL Command :: localhost:21212’ (21212 is the default port) and then a new line with ‘1>’. From here you can execute SQL commands against the database. Be sure to use a semicolon at the end of each command

To leave the SQL Command window type:
exit

And to stop the database service:
voltadmin shutdown


VoltDB – Installation

February 27, 2014

Last month I wrote on some talks by Michael Stonebraker concerning the future of relational databases. Currently he works on the VoltDB project, so I was interesting in learning more about it.

Overview:
Volt is a distributed database system, which supports sharding data across multiple nodes, much like a NoSQL database such as MongoDB. However, it is a relational database, ACID compliant, and uses SQL for data access. It is designed to access data in-memory, avoiding locking behaviors found in a traditional RDBMS, but can still persist data to disk.

Prerequisites:
VoltDB runs on either 64 bit Linux or OS X. Here I detail my set up with a Linux OS.
The only other component I needed to install was the latest Java SDK. From the terminal emulator (command line):
sudo apt-get install openjdk-7-jdk

Installing VoltDB:
The VoltDB product comes in two editions: Enterprise and Community. I chose to download the Enterprise edition, which comes with a 30 day free trial, as a Debian package.
After downloading the package to the home directory, the product can be installed from the terminal emulator:
sudo dpkg -i voltdb_4.0.2.3-1_amd64.deb
where voltdb_4.0.2.3-1_amd64.deb is the full name of the downloaded Debian package.

The files were installed under the /usr/lib/voltdb directory.

I’ll continue with a part 2 on creating a database in VoltDB.


Amazon Web Services

December 25, 2012

Amazon Web Services
Updated: July 24, 2016

AWS – Amazon Web Services
EC2 – Elastic Compute Cloud – Virtual machines
S3 – Simple Storage Service – Can store files along with file metadata
SimpleDB – Key-value storage
DynamoDB – NoSQL Database – Stores attributes for an item in key-value pairs – SimpleDB is capped in size, plus Dynamo can run on SSDs – More expensive than SimpleDB, but is faster and can work on larger datasets.
Amazon EMR – Elastic MapReduce
Amazon RDS – Relational Database Service
Amazon ElastiCache – In-memory cache
Amazon Glacier – ‘Cold Data’ – archive of infrequently used data
Amazon EBS – Elastic Block Store
Amazon Redshift – Data warehouse service
Amazon Aurora – MySQL compatible DB


NuoDB

December 24, 2012

Another database system I’ve checked into is NuoDB. NuoDB describe themselves as an ‘Emergent’ database system, which they define as “simple and autonomous actions by components that produce complex, coordinated behaviors in the overall system”.
NuoDB is horizontally scalable, like many NoSQL systems, but supports SQL along with ACID transactions.It is asynchronous and uses peer-to-peer communication, with no single point of failure. NuoDB can use a normal file system for storage or can use a cloud service (Like Amazon S3).

In a deployment, at the top level is a domain, which is a collection of hosts.

There are three layers to a NuoDB deployment:
1) Management:
Broker – Manages the Transaction and Storage engines
Agent – runs on a host – Manages the processes running on a computer- Each host must have an agent.
2) SQL – Transaction engine
3) Storage – Storage Manager – co-ordinates the data storage to a file system or a cloud service

NuoDB also has an concept of a ‘chorus’, which is a database along with the processes that support it. A chorus can go across multiple domains.
NuoDB includes a web console to manage the different processes.
So processes can be added and taken offline on the fly, the broker will take care of routing the requests. A chorus can also have multiple brokers for redundancy.