Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An easy way to think about database nulls is how you would answer a question in real life about unknowns.

I have a friend James, and another friend Jane. I don't know their ages. Are they of the same age?

Or Jane has a car. James has a blue car. Do car colours differ?

SQL provides human answers: explicitly answers "I don't know".

Other languages treat null as if it was just another value, providing very confusing answers. I wish all programming languages would use SQL style of null.



Eh... it's exceedingly common to want to determine whether or not two expressions both evaluate to "unknown" (e.g. null = null). I literally wrote a query that needed to respect null equality yesterday. Extending your examples it would be the equivalent of asking:

"Is the answer to 'Are they of the same age?' the same as the answer to 'Do car colors differ?'"

Both answers are "I don't know". So... are they the same answer? The human in me says "Yes".

Of course you will recognize the similarity between the above and your first example. The interesting part is in the subtle difference between:

"Are they of the same age?" --> "I don't know"

and

2. "Is the answer to 'What is James' age?' the same as the answer to 'What is Jane's age?'" --> "Yes"

The answer to number 2 relies on the fact that I _know_ I don't know _either_ of their ages. This extra piece of knowledge allows me to "lift" null into a value rather than the absence of. It's the difference between:

`Maybe<T>.value == Maybe<T>.value`

and

`Maybe<T> == Maybe<T>`


The answer to "Are they the same age?" is not really "I don't know". Try answering "I don't know" to a question during a quiz or an exam, it's very unlikely that your answer will be considered correct (even if it's actually true that you don't know it).

You don't know the answer to the question "Are they the same age" (but it's either true of false), and you don't know the answer to the question "Do car colors differ" (it's either true or false). If you don't know the answer to either question then you don't know whether the answers differ either.


I thought that was a good way to lay out the argument, so thanks for that!


The problem is that semantics of NULL in SQL are not that of "unknown". Well, they are in most cases, but there are also many when it's not true - e.g. SUM().

Even if they were, it's just a horrible name for the concept. You could argue that back when the keyword was introduced into SQL, it didn't yet have a well-established meaning that was different... but I'm not even sure that is true - some form of NIL or NULL has been around since the early Algol extensions, and it never meant "unknown".

The way to fix it would be to have MISSING and UNKNOWN as two different concepts. Then:

   UNKNOWN + 1 -> UNKNOWN
   MISSING + 1 -> *error*

   TRUE AND UNKNOWN -> UNKNOWN
   TRUE AND MISSING -> *error*

   UNKNOWN = UNKNOWN -> UNKNOWN
   MISSING = MISSING -> TRUE
Well, and fix the aggregates - e.g. SUM() of no values should just be 0.


> some form of NIL or NULL has been around since the early Algol extensions, and it never meant "unknown".

NULL doesn't really mean unknown; it means that instead of having a pointer to a result-set, you have no such pointer.

But in the semantics of answering questions, ala Prolog, "the absence of a result set" means that you don't know something, since those "results" are the answers to a question.

An alternative way to think of NULL is that it's a short-circuiting of what would otherwise be the database blocking on a promise for data to resolve, that never resolves—waiting for its deductive algorithm to get the data necessary to definitively answer the question.

> have MISSING and UNKNOWN as two different concepts

You already do. Just have a boolean column "foo_present" (or an enum column "foo_type", where one type means "intentionally absent") beside your nullable whatever-typed column "foo". In this case, a foo_present=TRUE together with foo=NULL means you're missing the foo value that you really expected to be there—it's a semantically-present, in-practice-unknown value (which probably indicates that e.g. the enrichment step hasn't gotten around to filling it out yet.) Meanwhile, if foo_present=FALSE, the value of foo itself doesn't matter (though it probably should be NULL) since you won't be looking at it in the first place; whatever value it has in practice, it's not meant to have a value semantically. It's pretty easy to construct queries around this paradigm.

Or, if you're using a SQL RDBMS that supports custom types, you can just declare a sum type, e.g. Pet = Present pet_id | Nothing.

Or, if you're using an SQL RDBMS that supports array product-types (e.g. "int[]"), you could just use an array-typed column with a CHECK constraint on it limiting it to have either 0 or 1 value. Then a value can be [], [foo], or NULL.

All of these allow you to see when the information of absence has been explicitly entered, separately from the query-response of unknown-ness which can come from a variety of places (such as e.g. joining several tables for sub-types of something together, where the columns that don't exist for a given sub-type will appear NULL.)


>> NULL doesn't really mean unknown; it means that instead of having a pointer to a result-set, you have no such pointer.

NULL in SQL has nothing to do with pointers.

It means that the value is missing or otherwise unknown.

Consider the column that is tracing the reading of a meter of some kind. And let's say that we could not get a reading -- due to some fault. So we did a reading, and we want to record that. What "value" should we record? 0 is no good - may make you think that we recorded a reading of 0 - which is not true.

NULL is the solution - it specifically tells you that the reading is missing or inapplicable.

Obviously, this complicates things, but incidentally translates well to other languages that have a (completely unrelated) concept of a null pointer.


Yes, relational algebra isn't a pointer machine. I was trying to give a mental image of what is going on—the resolution or lack thereof of tuples in a tuple-space.

Let me try another analogy, that might suit you better: if you think of each row-tuple in a relation in SQL as an asserted "fact" in a logic knowledge-base, e.g.

    INSERT INTO parents VALUES ('x', 'y')
equates to the logical assertion:

    parents('x', 'y').
...then a NULL result in SQL is the equivalent of asking your theorem-prover to look at this knowledge-base and deduce the truth-state of other fact you never told it anything about, e.g.:

    ?- parents('a', 'b').
...and getting back "unsat" — meaning that the theorem-prover couldn't prove your assertion true or false.

A theorem prover has the option to answer any question as Satisfied (Result) | Unsatisfied. This is what I meant by saying "a pointer to a result set, or a lack thereof." NULL is when you don't have such a pointer—which is why it's called NULL, by analogy to the DBMS's C implementation where the rowset pointer itself is likely set to be the pointer sentinel value ("NULL" or 0), rather than set to point to an empty rowset. The name NULL probably ended up in SQL after "bubbling up" from the implementations of the DBMS engines in this way.

The Weird Thing About SQL is that the grammar is phrased in terms of individual rows, whereas the query-planning occurs in terms of relations. So, when doing joins, you need a way to talk about what maps to "a join of a populated relation to an unpopulated relation", which in terms of individual rows, necessarily translates to "a real row-tuple value on one side of the join, along with some scalar sentinel row-tuple value [NULL] on the other side, standing in for 'a member of' the empty result-set."

That's the core semantics of NULL in SQL—to give you a 'scalar handle onto' the fact that there's no row on the other side of a join. That's where NULLs "come from", if you don't introduce them yourself (i.e. if you never make any of the columns in your tables nullable.) And so those semantics must be what you look at when figuring out what NULL "means." In joins, NULL doesn't quite mean "missing" or "unknown", but rather a combination of the two; it means "unsatisfied", just like in the theorem prover; it means that the query-planner has not yet been told enough to resolve your query fully.

(Yes, you can model an ingestion semantics where any row-tuple value that's not "in" a table, means that whatever predicate is represented by the presence of a value in the table should be considered to resolve false in the relation—but that's not really how SQL query-planning sees the world. It thinks of rows it doesn't have as "nobody bothered to tell me this yet"—unsatisfied—rather than "explicitly missing"—false. If you want false-ness semantics in SQL, it'd probably be better to define a view that generates rows with zero-values/false-values where the underlying table doesn't have values, and join to that instead.)


Philosophically that makes sense.. but in practice I honestly don't recall a time that I've found NULLs being incomparable to be useful. It's always been the case that I want NULL to match NULL.


If you GROUP BY some column, do you want the rows where the column is NULL to get grouped together? They're probably all distinct things.

For example, if your items represent tags, and you're grouping by parent_id to find subtags, the ones without parents (parent_id IS NULL) aren't all subtags of some root of a tag tree; they're all distinct root tags in a tag forest.


Exactly, NULL != NULL is surprising since pretty much every other language treats them equal and seems to serve no purpose other than being philosophically "right".

Even SQL databases that adhere to ANSI NULLS must treat NULL==NULL sometimes (group by, order by)

Obviously you need to compare nulls hence the crazy IS NOT DISTINCT FROM which is just another way of saying "equal" in English.


Exactly right. For this reason, I don't think the article should have chosen middle names as an example. NULL is appropriate if you don't know a person's middle name, but if you know they don't have a middle name, the empty string is better, assuming you're not using Oracle where empty strings are NULL.


That's fine, but it still doesn't quite intuit when you consider:

Do I know the same amount about whether their age or their car colors differ? : Yes

And that is the crux of the intuitive breakdown of 3VL IMO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: