This post is about validating data in a single table. We have gathered data from the Internet and we have stored it into a single table but we are not sure if the data is correct. We need to run a series of queries on the table to help us identify errors. In this post we will use the example of a table containing data on fruits. There are types of fruits and there are sub-types of fruits called varieties. Below is a screen shot of just a small part of the data with some sample data. When we are finished collecting the data we expect that there will be thousands of rows, possible tens of thousands or rows. We don’t want to go through row by row to find errors in the data, rather we want to devise a series of queries against the table that will help us identify possible errors so that we can correct them.
Another example of validation is discussed in the post called SQL Server Finding Bad Characters. We will not discuss the existence of unacceptable characters in data columns in this post.
Below is a screen shot of some fruit data in a table called Fruits. After doing some research, we have determined that the fruit experts will allow us to classify a fruit type as Oranges and a sub-type as Mandarin. For our purposes, it is okay that we have two rows of Mandarins with the sub-sub-types of Clementines and Tangarines. Normally however we will not have sub-sub-types.
We will look at two different cases. The first case is the more likely case.
The data in the following Fruit table is clean and acceptable. Later we will look at some data that is “dirty” and the queries we can use to identify those dirty rows.
The data in the following FruitsMandarin table is clean and acceptable. Later we will look at some data that is “dirty” and the queries we can use to identify those dirty rows.
- All columns must contain data. NULL in any one of the columns is not acceptable.
- There is a one-to-one relationship between the FruitName and the FruitCode. If you are given the fruit name you must return exactly one non-null fruit code. For each fruit code there is exactly one fruit name.
- There is a one-to-many relationship between the fruit code and fruit variety code. Each type of fruit has one or many sub-types, or varieties. Each fruit type must have at least one sub-type (variety). NULL varieties are not acceptable.
- The fruit variety code does not have to be unique in the table. Notice that 09 appears two times. That is acceptable.