Friday, February 03, 2006

TQL: A Standard Syntax for Multi-Tag Queries

Introduction


Tagging is booming. Everywhere you look, it seems that these little free-text Post-Its have become the semantic glue of Web 2.0.

But therein lies the problem - the tech-heavy community of del.icio.us users tends to use the tag "java" to refer to the programming language Java, as is immediately apparent from the link above. The same tag on Flickr, however, shows lots of photos of the island of Java, and none relating to the programming language.

This is to some extent inescapable - a single tag is always context-sensitive, and taken out of that context, the inherent ambiguities of language become apparent.

These ambiguities can be resolved, or at least reduced, by using multiple tags to provide context. For instance, anything tagged with Java AND code is probably about the programming language, whereas something tagged with Java AND Indonesia probably refers to the place.

Or maybe not - according to Wikipedia, Java is also a type of coffee which originated on the island, and it's also a term for the Javanese language. The combination of java AND indonesia could plausibly be used for both of these subjects aswell. java AND coffee should drill down to the coffee-related posts, but java AND language ? Well, which language? The programming language or the spoken language?

So maybe we need a third tag? Something like java AND language AND javanese. But then we are relying on any item about the Javanese language being tagged with all three tags. This may not be true in many cases - if a post is tagged with just java AND javanese, our three-tag query above will not return it.

So how about java AND language OR javanese? But then we must be careful - does that mean that it must have java AND EITHER language-or-java, or does it mean EITHER java-AND-language OR javanese ?

There are some tools and APIs already starting to emerge which support combinations of tags - Del.icio.us supports tag "unions" via the + operator. Ultimate Tag Warrior, the tag plugin for WordPress, supports tag "unions" and "intersections" via the + and | operators. Technorati, on the other hand, uses a plain-text string of "tag1 OR tag2" in the URL, which gets URL-encoded to "tag1%20OR%20tag2".

So which is it to be?

A Proposed Solution

What we need, then, is a generalised boolean syntax for tag URLs. A Tag Query Language, if you like. It should probably have certain properties :

Properties

  1. It should be able to be used in a URL
    so no "/" characters, no "." characters, etc.
  2. It should be human-interpretable
    a person should be able to look at a TQL query string and work out what was intended without too much effort.

  3. It should be simple
    Web 2.0 is showing us that the simple solution is usually the best, and something which requires too much time to implement is just not going to be widely adopted.
  4. It should be easily translatable into standard SQL
  5. It must not expose the implementing website to potential SQL Injection attacks

So what form would such a syntax have? Maybe something like the following:

Operators

  • A boolean AND is indicated by the plus sign +
  • A boolean OR is indicated by the pipe symbol |
  • A boolean NOT is indicated by the exclamation mark !
  • Ambiguous logic (e.g. "x and y or z" ) can be resolved by grouping with braces ( )

Rules

  • To satisfy requirement 5, any character which is not an alphanumeric or one of the above qualifiers should be stripped out.
  • To satisfy requirement 3 - the KISS principle -
    • braces must not be nested.
      That way lies a whole world of nastiness...
    • all operators have equal precedence after grouping with braces ().
      In the absence of braces, they
      should be evaluated in straightforward left-to-right order.

To use our example above, this would allow a query for items relating to java, the spoken language to be written as follows:

java+(language|javanese)

Example SQL

Assume that :

  1. tags are stored in a column tagcolumn in a table tagtable
  2. the items we want to return are stored in a table itemtable
  3. items are related to tags in a many-to-many : one item can have many tags, and a tag can apply to many items.
  4. this link is accomplished by an intermediary linktable relating itemids to tagids
  5. we can represent the selection criteria for an item by the placeholder myothercriteria

Then our query above would translate to something like the following:

SELECT (columns)
FROM (itemtable)
WHERE
EXISTS (
SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria
WHERE tagcolumn = 'java' AND linktable.itemid = itemtable.itemid
)
AND
(
EXISTS ( SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria
WHERE tagcolumn = 'language' AND linktable.itemid = itemtable.itemid
)
OR EXISTS ( SELECT tagcolumn FROM tagTable INNER JOIN linktable ON linkcriteria
WHERE tagcolumn = 'javanese' AND linktable.itemid = itemtable.itemid
)
)

Of course, this SQL is only an example, and is written for clarity rather than performance. There are many optimisations which could be done in this scenario, and many platform-specific options to be considered. For instance, SQL Server allows the indexing of views, which could reap large benefits in this application. Coldfusion allows the re-querying of in-memory resultsets using SQL syntax, However, these optimisations are implementation-specific, and as such are beyond the scope of this article.

No comments: