If this page does not print out automatically, select Print from the File menu.

Hands on: SQL conditions and operators

We continue a tour of SQL querying and look at a solution to the HMRC’s woes

Mark Whitehorn, Personal Computer World 20 Mar 2008

As I said in a previous column, knowing some SQL gives a great boost to your querying skills, and so the theme continues here.

We’ve already met the WHERE clause, as in:
SELECT LastName, FirstName
FROM People
WHERE FirstName=”Bill”;
which returns anyone in the People table with the first name of Bill. The WHERE clause is called a ‘condition’: it defines the conditions under which we want rows returned. The equals sign that follows WHERE is an ‘operator’ and is one of many that can be used.

A condition is evaluated and the rows for which it evaluates as true are included in the answer table. Apart from using the = operator, I slipped one other into the previous column mentioned above:
SELECT FirstName, LastName
FROM People
WHERE (3<4);

In this query the condition is ‘where three is less than four’ and will always evaluate to true and therefore return all rows.

Given the table shown in the accompanying picture, we can find all people with more than three children:
SELECT FirstName, ChildNo
FROM People
WHERE ChildNo >3;
(Q01 – see below for an explanation of the numbers in brackets.)

This returns four rows. (The file DBCMAR08.MDB on the cover disc for the March 2008 issue of PCW contains all the queries here, each has a number (ie (Q01) shown after the query). To exclude people with six or more children, you’d alter the condition to:
WHERE ChildNo>3 AND ChildNo<6;
(Q02)
and two rows are returned. Alternatively you could use IN to identify the acceptable values, which returns the same two rows:
WHERE ChildNo IN (4,5);
(Q03)

IN also works with text:
WHERE HomeTown IN (‘London’,’Wigan’,’Bedford’);
(Q04)
finds people from these towns. For numbers of children of between two and four, it’ll be no surprise that you’d use the BETWEEN… AND operator:
WHERE ChildNo BETWEEN 2 AND 4;
(Q05)

Numbers of two, three and four are returned. This operator can also be used with text strings, for instance:
WHERE HomeTown BETWEEN 4
‘Edinburgh’ AND ‘London’;
(Q06)
returns all home towns that are alphabetically between, and including, Edinburgh and London. The six comparison operators work with numbers, text and dates, so to find all home towns after (alphabetically speaking) but not including Hereford you can use:
WHERE HomeTown>’Hereford’;
(Q07)
and this:
WHERE HomeTown>=’Hereford’;
(Q08)
includes Hereford, while this:
WHERE People.DateJoined>4
=#1/3/2004#;
(Q09)
will return all rows with join dates on or later than 1 March 2004.

To identify people for whom we don’t know certain information, IS NULL works with numeric or text values, like this:
WHERE ChildNo IS NULL;
(Q10)
WHERE HomeTown IS NULL;
(Q11)

Further operators
There are also three logical operators that are processed after any of the operators in the list above (though it is possible to alter the order in which they’re performed by the use of brackets). We could write:
WHERE ChildNo=2 OR ChildNo=4;
(Q12)
to find people with two or four children. We get an answer if either one of these conditions is met, as you can see in:
WHERE ChildNo=2 Or ChildNo=8;
(Q13)

There are no people with eight children but the query still returns the rows for those with two. Hopefully you should be able to work out what these next two queries do by now. Try to guess the output before running them:
WHERE ChildNo=3 AND 4
HomeTown=’London’;
(Q14)
WHERE ChildNo=3 AND NOT 4
HomeTown=’Leeds’;
(Q15)

As you can see, SQL operators and predicates can add immense flexibility to your queries.

Head for the hills…
O’Reilly’s Head First series aims to make the learning process engaging and entertaining – a truly laudable goal. But applying a formula – graphics, diagrams, annotations, crosswords, quizzes, questions and exhortations to ‘be the table’ – doesn’t guarantee to do the trick when it isn’t allied to accuracy.

This is apparent in Head First SQL by Lynn Beighley (O’Reilly, 0-596-52684-9) where in the introduction we read: “but we assume that if you’re holding this book, you really want to learn about project management”. Er, no; that would be SQL, actually.

And it gets worse. On page 164 we find the gem: “SQL is known as a Relational Database Management System or RDBMS. Don’t bother memorising it. We only care about the word RELATIONAL”. The second sentence is entirely accurate but its juxtaposition with the first one is fabulously ironic.

Please don’t memorise the information because it is completely wrong. SQL certainly is not an RDBMS – SQL is a language and an RDBMS is an application. It’s rather like confusing petrol with an internal combustion engine: they work together but are two fundamentally different things.

A lack of understanding at this basic level destroys any confidence in the author’s authority. Teaming the author with a good technical proofreader could have made all the difference.

And then there’s the style. Head First SQL has, for my tastes, been a little too enthusiastic in its embrace of the “gosh, this stuff is so difficult to learn we’re going to show you a picture of a man looking perplexed” school of book writing.

Books with character certainly help the learning process. I have several favourites on my shelves, which have silly illustrations and captions (I thought I’d mention this in case I sound too stuffy…) This one won’t be joining them because it doesn’t feel idiosyncratic, it feels like formulaic funkiness.

That database scandal…
As I write this, the furore surrounding the HMRC’s loss of 25 million people’s records is still in the news. This isn’t a political column but I thought it worth looking at the situation from a database point of view.

The National Audit Office (NAO – the body to which the lost data was en route) asked for records with bank and other details removed. Given the information presently to hand, we know that complete records were sent out on disc because it was deemed too expensive to subset the data.

Let’s assume that the data is held in a modern relational database engine. Saying that it is too expensive (presumably in time and effort) to subset it is nonsense.

One major reason for holding data in databases is to enable us to request subsets of data as and when we want them. Almost every query run against a database subsets the data in some way. In SQL we’d write, not:
SELECT *
FROM AllData
but
SELECT City, PostCode, 4
ClaimAmount, Whatever
FROM AllData
How tricky is that? 25 million rows isn’t a terrifyingly vast table these days – the operation could be performed on a laptop.

Alternatively, imagine the data is held, not in a relational engine, but in something like, say, IMS (IBM’s Information Management System, a transaction and hierarchical DBMS).

Presumably HMRC would have staff capable of writing a query to subset data or, at the very least, to output the entire table into a simpler format such as CSV (comma separated values) which could then be imported into any one of a large number of tools for subsetting.

In addition, the NAO requested a random selection of 100 records without sensitive details. Selecting 100 records is a trivial matter, but ensuring that they’re a random sample is rather more difficult. (It’s a subject I’ve addressed in earlier columns, but if you’d be interested to read more, drop me a line and I’ll revisit the subject).

Nevertheless, selecting an effectively random sample is possible. If the HMRC couldn’t do it, it could have (rather pathetically) supplied the data subsetted by column (the easy bit) as requested, and left the selection of the sample to the NAO. It didn’t. It took the lazy and incredibly foolish action of sending the whole lot.

www.pcw.co.uk/2212820
This article was printed from the Personal Computer World web site
© Incisive Media Ltd. 2008
Incisive Media Limited, Haymarket House, 28-29 Haymarket, London SW1Y 4RX, is a company registered in the United Kingdom with company registration number 04038503
Close this window to return to the website