SQLite Lesson 2
This lesson is adapted from information provided by Investigative Reporters and Editors. Much of the following includes IRE practice data for the purpose of training. I've included additional info to further clarify difficult concepts and to make it a little bit fun.
bit.ly/cunySQLp2
Queries in the movies
For part 2 of this intro to SQL, we'll tackle how to group, sum and count. Part 3 will cover how to join.
Keeping with the data about US dams from the previous lesson, let's think of a few additional questions and then translate them into queries, this time thinking about results related to quantities:
Which county has the most dams?
Which county has the greatest number of old dams?
Which river has the highest number of old dams (in all three states)?
How many dams are there in Virginia in each category of hazard?
Find the rivers and the inspection history of the dams in the county with the largest number of high hazard dams.
GROUP BY
Recall, when we ask SQL to group items, it organizes items that are alike (exactly identical) in the results that we see.
COUNT vs. SUM
Here's a great way to think about the difference between counting versus summing. Think about how many units of currency there are in circulation in the United States today.
If you count the units, they equal 14 -- we have the penny, nickel, dime, quarter, half-dollar, and two versions of the dollar as the current coins. And then the current treasury notes are the one, two, five, ten, twenty, fifty and hundred dollar bills.
Yet if you add together their values:
$0.01 + $0.05 + $0.10 + $0.25 + $0.50 + $1.00 + $1.00 + $1 + $2 + $5 + $10 + $20 + $50 + $100
their sum total amount is $190.91.
In one case you count, in the other case you sum and SQL can make the same distinction when you query your data. While you can COUNT both numeric and non-numeric data, you can only SUM numeric data. Why is this the case?
IMPORTANT: When you count and/or sum, the figures that SQL returns are not actual numbers within the data, rather they are virtual stats, the results of calculations that SQL performs on the fly based on the math that you imply in your query. Formally, they are called "aggregate functions." The other aggregate functions calculate the minimum, maximum and average results of given data.
SUM()
Here's an example of how you might use the SUM()
function in this dataset. Since we'd be looking to query the numeric values, i.e. integer data types, let's recall that one of the numeric fields explained in the definitions list was NID_Storage:
Field | Type | Description |
---|---|---|
NID_Storage | number | Accepted as general storage of the dam |
First, let's start simply, before we sum, and use just the basic SQL clauses that we already learned in the previous lesson, such as ORDER BY
, to see which dam has the most voluminous water storage capacity. Write a basic query that meets the following criteria:
1- Show me state, county, dam name and storage, using the dams table, grouped according to dam name, then ranked in order of storage capacity, highest to lowest.
Result: You should see 2392 individual dams, the largest of which, John H. Kerr Dam in Mecklenburg County, Virginia, has more than 3.3 million units of storage capacity.
Next, let's use the SUM(NID_Storage)
function, for instance, to help us add up the full amount of storage capacity held by each county. This will tell us how much water storage each county is responsible for.
2- Show me state, county, and a sum total of storage capacity, using the dams table, grouped according to county, then ranked in order of the storage managed by each county, highest to lowest.
Result: 128 individual counties, with Mecklenburg County, Virginia responsible for 3.4 million units in total capacity.
With that new knowledge, we can now use a WHERE
clause (in place of group by and order by) to query the following:
3- Show me the full list of dams that are just in Mecklenburg County along with each individual dam's storage capacity.
Result: You should get 24 records that show the given details about each dam.
Lastly...
4- Use
COUNT(dam_name)
and theWHERE
clause to round up just how many dams there are in Mecklenburg County. How would you write that query?
Result: You should see one row indicating that Mecklenburg County has 24 dams.
State | County | COUNT(dam_name) |
---|---|---|
VA | MECKLENBURG | 24 |
Answers
1-
SELECT state, county, dam_name, NID_Storage
FROM dams
GROUP BY dam_name
ORDER BY NID_Storage DESC;
2-
SELECT state, county, SUM(NID_Storage)
FROM dams
GROUP BY county
ORDER BY SUM(NID_Storage) DESC;
3-
SELECT state, county, dam_name, NID_Storage
FROM dams
WHERE county = "MECKLENBURG";
4-
SELECT state, county, COUNT(dam_name)
FROM dams
WHERE county = "MECKLENBURG";
For the rest of this lesson we're most concerned with counting.
COUNT()
One of SQL's built-in functions is COUNT()
. The count function is used within the SELECT
clause and takes columns as parameters within its parenthesis. These arguments ask SQL to count the number of rows identified within those columns.
Here's how the COUNT()
function's syntax works:
- COUNT(*)
tallies all rows
- COUNT(column)
tallies all the non-NULL values
- COUNT(DISTINCT column)
tallies all the unique non-NULL values
How would you write the following query:
5- Show me each state, plus a tally of all their rows (under the header "Total Dams"), using the dams table, then group each tally according to state and sort them greatest to least.
Result: You'll see three calculated results that are full tallies of all the dams within each state.
State | Total Dams |
---|---|
VA | 2063 |
MD | 336 |
DE | 83 |
6- Next, write a query for that returns the following result:
State | Distinct States |
---|---|
VA | 1 |
MD | 1 |
DE | 1 |
Result: This double-check will let us confirm that that this entire dataset is perhaps a subset of the national database as it only includes the various dams within the three Chesapeake region states of Virginia, Maryland and Delaware.
So, regarding this subset of data, let's find nulls as a quick integrity check. Let's use the NIDID
column and the ORDER BY
clause to check the unique ID key field for any missing values (recall, the default sort order is ascending so if there are any null values in this column then they will appear at the top of our results):
7- Show me
NIDID
, using the dams table, and sort the results in ascending order.
Result: We get all 2,482 records in the table so there are no null values in the NIDID
column.
Alternately, to double check, you could write the following:
8- Show me 'NIDID', using the dams table,
WHERE
the dam IDIS NULL
.
Result: We get 0 null values. Both queries help us conclude that there are no records with blank IDs.
So now, how do we count which county has the most dams?
Again, we use SQL's built-in COUNT()
function to tally rows.
9- Show me the state and the county fields (in case there are identical county names in multiple states), and a tally of all rows, using the dams table, then group results first by state, then by county, then sort the resulting figures in descending order.
Result: 136 rows, showing that Albemarle County, Virginia has 168 dams.
So which county has the highest number of old dams?
10- List, count, sort and rank dams that have a year of completion prior to 1974.
Result: 132 rows, showing that Albemarle County, Virginia has 55 dams built prior to 1974.
Which river has the largest number of old dams in Fauquier County?
SELECT river, COUNT(*) AS "Total"
FROM dams
WHERE county = "FAUQUIER" AND year_comp < 1974
GROUP by river
ORDER BY COUNT(*) DESC;
FYI, TR stands for tributary -- a stream that flows into a larger river or into a lake -- so the first result could indicate that there are four different tributaries into Cedar Run River. Therefore, Buck Run River is possibly the one river with the largest number of old dams on it, at three.
How many Virginia dams are there in each category of hazard?
SELECT NPDP_hazard, COUNT(*) AS "Total"
FROM dams
WHERE state = "VA"
GROUP BY NPDP_hazard
ORDER BY COUNT(*) DESC;
Result: While there are 162 high hazard dams, note that there are 874 with blank null values. This could mean the hazard assessment is undetermined (see definitions list), so you would want to check with the agency.
Which is the county with the most high hazard dams?
SELECT state, county, NPDP_hazard, COUNT(*) AS "Total"
FROM dams
WHERE NPDP_hazard = "HIGH"
GROUP BY state, county, NPDP_hazard
ORDER BY COUNT(*) DESC;
Result: 74 rows showing that Augusta County, Virginia has the highest count at 14 high hazard dams.
Answers
5-
SELECT state, COUNT(*) AS "Total Dams"
FROM dams
GROUP BY state
ORDER BY COUNT(*) DESC;
6-
SELECT state, COUNT(DISTINCT state) AS "Distinct States"
FROM dams
GROUP BY state;
7-
SELECT NIDID
FROM dams
ORDER BY NIDID;
8-
SELECT NIDID
FROM dams
WHERE NIDID IS NULL;
9-
SELECT state, county, COUNT(*)
FROM dams
GROUP by state, county
ORDER BY COUNT(*) DESC;
10-
SELECT state, county, COUNT(*) AS "Total"
FROM dams
WHERE year_comp < 1974
GROUP by state, county
ORDER BY COUNT(*) DESC;
JOIN
The next lesson will delve deeper into the JOIN
clause, but to give you an idea of how much of an impact these types of queries can have, here's an example ripped from the headlines.
In 2008, The New York Times published an investigation that drew largely from the queries its reporters ran that cross referenced LIRR retired employee data with data that recorded the institution's disability recipients. The probe exposed the alarming number retirees who were collecting unjustified disability payments from the state of New York.
A Disability Epidemic Among a Railroad's Retirees
Agents Raid Office in L.I.R.R. Disability Inquiry
Paterson Seeks Study of Benefits at L.I.R.R.
Heads rolled after this investigation appeared! The reporters accomplished their research by performing JOIN
queries, i.e. cross-referencing records and narrowing down accurate identities of individuals whose circumstances raised flags. For the next lesson, we will tackle a far less complex dataset about so-called deadbeat dads.
You might also be interested in these articles...
professed professor