The Web Is Not A Normalized Relational Database

I had lunch with Stan James on Friday at Pasquini’s Pizzeria.  Stan is the creator of Outfoxed and was introduced to me by Seth Goldstein who is one of the guys behind and has recently launched Root Markets (Seth has a long essay up about Root Markets: Media Futures: From Theory to Practice that is very interesting (and complex) if you are into this stuff.))

Stan’s moving to Boulder to be in the middle of the Internet software development universe (ok – he’s moving back here because it’s a much better place to live than Silicon Valley, but don’t tell anyone).  We spent a bunch of time getting to know each other, we talked about the research he’d been doing for his masters these in Cognitive Science at the University of Osnabrueck, and how this led to Outfoxed.  Oh – and we ate a huge delicious pizza.

I’d been playing with Outfoxed for a few days on my computer at home (I have a computer at home that I’ll install anything on) and was sort of getting it.  An hour with Stan helped a lot.  When I combine what Outfoxed is figuring out for me with the data I’m getting from Root’s Vault (my clickstream / attention data) I can see how this could be really useful to me in a few weeks once I’ve got enough data built up. More in a few weeks.

We then started talking about something I’ve been thinking about for a while.  My first business was a software consulting business that built database application software.  As a result, the construct of a relational database was central to everything I did for a number of years.  In the mid 1990’s when I started doing web stuff, I was amazed at how little most people working on web and Internet software really understood about relationship databases.  This has obviously changed (and improved) while evolving rapidly as a result of the semantic web, XML, and other data exchange approaches.  But – this shit got too complicated for me. Then Google entered the collective consciousness and put a very simple UI in front of all of this for search, eliminating the need for most of humanity to learn how to use a SELECT statement (ok – others – like the World Wide Web Wanderer by Matthew Gray (net.Genesis) and Yahoo did it first – but Google was the tipping point.)

I started noticing something about a year ago – the web was becoming massively denormalized.  If you know anything about relationship databases, you know that sometimes you have denormalized data to improve performance (usually because of a constraint of your underlying DBMS) but usually you want to try to keep you database normalized.  If you don’t know about databases, just think denormalization=bad.  As a result of the proliferation of user-generated content (and the ease at which is was created), services where appearing all of the place to capture that same data (reviews: books, movies, restaurants), people, jobs, stuff for sale.  “Smart” people were putting the data in multiple places (systems) – really smart people were writing software to automate this process. 

Voila – the web is now a massively denormalized database.  I’m not sure if that’s good or bad (in the case of the web, denormalization does not necessarily equal bad).  However, I think it’s a construct that is worth pondering as the amount of denormalized data appears to me to be increasing geometrically right now. 

Stan and I talked about this for a while and he taught me a few things.  Stan is going to be a huge net-add to the Boulder software community – I’m looking forward to spending more time with him.

  • As a long-time software developer and current graduate student at MIT, I’ve been giving this issue some thought for a little while.

    On the continuum of normalized to denormalized data, I think the advent of the web we have shifted a lot towards the “denormalized” side without quite thinking through the implications and trade-offs. I shudder at the thought of having some of our software systems from the 1980s and 1990s constructed around unstructured (denormalized data). Though this would likely have made it easier for users contributing the data, getting usable information out in a reliable way would be very difficult.

    Right now, Google is making up for the lack of structure by using a highly evolved “brute force” method of trying to figure out relevancy and such. But, queries that are trivial with a normalized database are very difficult to do on the web today.

    Back then, I could create a query that returned a list of all the customers from Massachussetts sorted in descending order by total sales in 2004. Wouldn’t take a particularly complicated database structure to pull this off.

    Today, even the simplest “structured” queries (which would be highly valuable) are simply not possible. We are limited to simple boolean searches that are still “keyword” driven.

    Wouldn’t it be nice to go to SQL-like searces on the web? The path we are headed down, this is near impossible until something like the Semantic Web becomes widely adopted and usable.

    As it stands, we’re trading off near-term user utility for long-term information usability.

  • Stan, welcome to Boulder. Glad to have another bright Web person here in town! You’ll find it a remarkably fertile field with a lot of smart cookies in the neighborhood. Just don’t tell those folk in Silicon Valley suffering through sky-high housing and terrible traffic just to breath smog and drive for hours to get anywhere worth visiting. 🙂

  • Murray Priestley

    Outfoxed looks quite interesting. I look forward to seeing how it goes. Some challanges that I see:

    1. The Kevin Bacon Effect – where everyone in the world is connected by the classic six-degrees of separation. Therefore it’s very likely that any single network of end-users will likely be global in no time, including everyone in the world.

    2. Given that the Kevin Bacon Effect exists, then site ratings could well be determined by people who have no business even turning on a computer.

    3. Good News Is No News. People will only negatively rate bad sites, and blissfully ignore the good ones.

    4. Finally, the Echo Chamber Effect – where you only end up reading stuff that you and your friends agree with. This is already a problem WRT political blogs/forums where typically blogs only link to other blogs which agree with their world-view. Not such a problem for non-political stuff of course.


  • Not sure if “denormalizing” is the right word here. Being “denormalized” means having a scheme, but that scheme does not satisfy, xNF, BCNF, DKNF…

    What we have in Web is ustructured data. Sir Bernes-Lee with Semantic Web, folksonomies, XML technologies all strive to bring some kind of semi-structure to these data.

    Normalized data and unstructured data are just two extreme ways of organizing information (and there is a lot of halftones in between) – and there is nothing wrong with either of them. I liken this to “human brain vs computer”. Rigid structure of a computer is is way more efiicient for getting the result of 2+2, but human brain is able to perform tasks impossible for a computer. Computers are getting closer to the way humans think, but to do that, most of the time they have to trade structure off.

  • Interesting stuff. Posted about this on my blog and sent you a TrackBack but it doesn’t seem to have worked. Appreciate your comments.

  • Building better personalized search, filtering spam blogs

    Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

    Individual users interests / search intent changes over time
    The click a…

  • Great blog! I would say that the web is full of normalized databases such as (yes, parts are denormalized but it is largely normalized) however you get to those normalized databases via a search in a denormalized database (Google). But who’s to say what is normalized vs denomalized? Are you modeling a web page or the elements on the web page?

  • Building better personalized search, filtering spam blogs

    Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

    Individual users interests / search intent changes over time
    The click a…

  • Derald Muniz

    A related article regarding the web and databases. Check out this article in InfoWorld dated 11/28/05 by Jon Udell:

  • Interesting food (pizza?) for thoughts! One of the goals of normalization, is to remove redundancies in the data. If you are storing the same information in several places, normalization of the data will centralize the data and reference it from those varied locations.

    One trend we are seeing on the web is the offering of APIs for information access. This means that information is being made available (and often updatable) by everyone and from everywhere. This is really a form of normalization, in the sense that the “master copy” of the data remains on the server that offers the API.

    If a service like Google Base helps to index a database and make it searchable by the world, it is acting more like a “cache” than a non-normalized database. Meaning, the original DB is still the “master copy” and updates to it will eventually “perculate” down into the Google Base, which simply provides faster, easier access to this data.

  • As you mention, de-normalisation is tactic used to improve performance (for queries) and it works very well in distributed environments. But de-normalisation is treated with caution in relational environments because of the performance and integrity implications when the data is changing. However, with widely distributed immutable data (data that does not change once it has been created), de-normalisation is the optimal solution

  • Apologies: In my previous comment, the penultimate line should have read “If we ever try to NORMALISE the Web, we will be in trouble”. Doh!

  • Bob Devine

    As Artem wrote, you don’t mean call the web “denormalized” because that implies there is a normalized version. Nope, the web is unstructured.

    In addition to the duplication of data, the web also “suffers” from conflicts, errors, and time-varying information. Sorta the opposite of “intelligent design”; the web is a mish-mash of documents slung together without rhyme or reason.

    For example, you would think that common facts could be clearly represented. But try to find the exact spelling, to pick a frequent search term, for Britney Spears. Do you mean “Brittany Speers” or many “Britainy Speirs”. Or you get the idea.

    One thing that most users of the web are not yet noticing is that there is a lot of old information. While the web is only a decade old, it has junk in it from the first day. How will the web look in another decade when you do a search for “best restaurant chicago” and find a list that is completely out of date?

    Bob Devine

  • On web speed means money and if denormalization (in some cases) can offer speed to visitors but a little pain for developers, it is worth. There is big gain in this small pain.

  • Pingback: annual free credit reports()

  • Pingback: cheap auto insurance in charlotte nc()

  • Pingback: limo hire London , limousine hire London , limo hire , limousine hire , hummer limo hire ,London limo hire , London limousine hire ,()