Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality

Density:Density, when used to describe the data in a column, is a measure of how often duplicate values occur in that column. Another way to think of density is as a measure of the uniqueness of the data in a column: high density –> less unique data. Density values range from 0 to 1.0.  There are different (but equivalent) ways to think of density.

Density = 1/[# of distinct values in a column]

Density = Avg. number of duplicates for a given value / Total row count

Q: What is the density of a column that contains only a single value repeated in every row? 

A: 1.0 — the highest possible density.

Q: What is the density of a unique column, such as an IDENTITY or primary key column? 
A: 1/[row count]

Predicate: A predicate is an expression that evaluates to True or False. Predicates appear in a WHERE or JOIN clause in a database query. They are used to either filter out or qualify rows.

Here’s a query with two filter predicates:

SELECT *
FROM table1
WHERE column1 > 20 AND column2 IS NULL

And a query with one join predicate:

SELECT *
FROM table1
INNER JOIN table2 ON table1.column1 = table2.column1

Selectivity: The fraction of rows from the input set of the predicate that satisfy the predicate.

Selectivity is also a measure of uniqueness. High selectivity implies high uniqueness, or a low number of matching values. Low selectivity implies a low uniqueness, or a high percent of matches.

Selectivity is most commonly used to describe a filter predicate.

Selectivity for a filter predicate against a base table can be calculated as “[# rows that pass the predicate]/[# rows in the table]”.

Note that 0.000001 reflects a high selectivity even though the number is small, while 1.0 is low selectivity even though the number is higher.

Cardinality estimate: An estimate of the size of a result set. For example, if a table T has 100,000 rows and a query contains a selection predicate of the form T.a=10, and a histogram shows that the selectivity of T.a=10 is 10%, then the cardinality estimate for the fraction of rows of T that must be considered by the query is 10% * 100,000 = 10,000.

Cardinality can be thought of as the number of rows returned by a query operator.

Each operator in a query plan has an estimated cardinality (the number of rows the optimizer guessed that the operator would return) and an actual cardinality (the number of rows that the operator returned in the real world).

Sometimes the query optimizer cannot accurately predict the number of rows that a given operator will return. This can prevent SQL from estimating the cost of a query plan correctly, which can lead to the selection of a suboptimal plan. Cardinality estimation errors are one of the most common causes of slow query plans in SQL Server, so it is very important to know how to identify cardinality estimation problems in a query plan.

Indexes have density, a measure of how unique the left-based column subsets within them are;

Predicates have selectivity, a measure of what portion of the table they affect;

Operators have Cardinality, a measure of how many rows the operator processes.

References:

http://blogs.msdn.com/b/bartd/archive/2011/01/25/query_5f00_tuning_5f00_key_5f00_terms.aspx

http://msdn.microsoft.com/en-us/library/dd535534(v=sql.100).aspx

http://sqlinthewild.co.za/index.php/2015/10/06/index-selectivity-and-index-scans/

Advertisements

Basics of normalization

From http://sqlmag.com/database-performance-tuning/sql-design-why-you-need-database-normalization

When you normalize a database, you have four goals:

  • Arranging data into logical groupings such that each group describes a small part of the whole
  • Minimizing the amount of duplicate data stored in a database
  • Organizing the data such that, when you modify it, you make the change in only one place
  • Building a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage.

A poorly normalized database and poorly normalized tables can cause problems ranging from excessive disk I/O and subsequent poor system performance to inaccurate data. An improperly normalized condition can result in extensive data redundancy, which puts a burden on all programs that modify the data.

First Normal Form

For a table to be in 1NF you need to ensure that the data is atomic, having no repeating groups.

A data item is atomic if only one item is in each cell of a table.

Ex: Attribute ShippedTo encompasses a street address, a city, a state, a postal code, and a country abbreviation. To render this data atomic, you separate this single attribute into several—ShipAddr, ShipCity, ShipState, ShipZip, and ShipCountry.

Repeating groups are cells that have more than one occurrence.

1NF requires that a table have a primary key

Second Normal Form

2NF is a condition of full functional dependency on the whole primary key; the primary key must determine each non-primary key attribute.

2NF says that all non-primary key attributes must be fully functionally dependent on the whole primary key.

Third Normal Form

You achieve 3NF when you have resolved all transitive dependencies. You’ll have to test the attributes in each table to see whether, within a table, any non-key attribute determines the value of another non-key attribute. Such a determination defines transitive dependency. Attributes that do not contribute to the description of the primary key are removed from the table.

  • Each table is a flat file, or spreadsheet format, with all-atomic data items, no repeating groups, and a designated pkey.
  • Each table has all non-pkey attributes fully functionally dependent on the whole pkey.
  • All transitive dependencies are removed from each table.