Psychic Origami - Using raw SQL with SQLObject and keeping the object-y goodness

This is sort of a continuation of my little SQLObject performance guide. So it might be worth reading that too, if you are after hints about speeding up SQLObject. Anyway, on with the show...

It's possible to create raw (database agnostic) sql queries with SQLObject. This can be really handy for those spots where you really need to speed things up. It's a bit like switching from Python to C for some performance intensive part of an application.

However when using raw SQL, we lose some of the nice-ness of SQLObject. Results arrive as tuples and we may then have to do more work to make use of them. So I'm going to discuss an example of using raw SQL in SQLObject, but still keeping the objects around.

The Model Code

In my example there are two model objects:

class Entry(SQLObject):
    title=StringCol(length=255)
    body=StringCol()
    views=SQLMultipleJoin('EntryView')

class EntryView(SQLObject):
    entry=ForeignKey('Entry')

Entry being a blog entry and EntryView being an object to keep track of the Entry being viewed. I've kept both objects free of details for this example, but obviously they could have all sorts of extra fields.

N+1 Queries

Now I want to get a list of all of the entries and how many views each entry has (sorted by number of views). So using regular SQLObject this looks like:

# class method on the Entry class
@classmethod
def get_entry_views(cls):
    entries=cls.select()

    # get the count for each entry
    entry_counts=[]
    for entry in entries:
        entry_counts.append((entry, entry.views.count()))

    # now sort the list into descending order
    entry_counts.sort(key=lambda item:item[1])
    entry_counts.reverse()
    return entry_counts

Which is pretty straight forward really and gives the follow results (for some sample data):

[(<Entry 3 title='entry 3' body='body text 3'>, 5),
 (<Entry 1 title='hfdskhfks' body='fsdfsd'>, 2),
 (<Entry 2 title='hel' body='jjj'>, 0)]

(tuple of Entry objects followed by view count).

However this causes the following SQL to be executed:

SELECT entry.id, entry.title, entry.body FROM entry WHERE 1 = 1
SELECT COUNT(*) FROM entry_view WHERE ((entry_view.entry_id) = (1))
SELECT COUNT(*) FROM entry_view WHERE ((entry_view.entry_id) = (2))
SELECT COUNT(*) FROM entry_view WHERE ((entry_view.entry_id) = (3))

Which seems a bit bad. In fact this is a classic example of the N+1 problem, where we run one initial query and then one query for each row in that result.

2 queries

So now let's try making that a bit better, with this alternative method:

# need to import everything from sqlobject.sqlbuilder
@classmethod
def get_entry_views2(cls):
    conn=cls._connection
    fields = [Entry.q.id,SQLConstant('COUNT(*)')]
    select = Select(
                    fields,
                    join=INNERJOINOn(Entry,EntryView,Entry.q.id==EntryView.q.entryID),
                    groupBy=Entry.q.id)
    sql=conn.sqlrepr(select)

    # get the counts via the raw
    # sql query
    counts={}
    for entry_id,count in conn.queryAll(sql):
        counts[entry_id]=count

    # now read in all of the entries
    # and match them with the counts
    entries=cls.select()
    entry_counts=[]
    for entry in entries:
        entry_counts.append((entry,counts.get(entry.id,0)))

    # now sort the list into descending order
    entry_counts.sort(key=lambda item:item[1])
    entry_counts.reverse()
    return entry_counts

This time I'm using a raw sql query to get all of the (non-zero) view counts in one query and then using another query to get all of the Entry objects. Then using a bit of Python I stitch the results back together and sort it.

This generates the following SQL:

SELECT entry.id, COUNT(*) FROM  entry INNER JOIN entry_view ON ((entry.id) = (entry_view.entry_id)) GROUP BY entry.id
SELECT entry.id, entry.title, entry.body FROM entry WHERE 1 = 1

That's not as bad as before, but if we were using regular SQL we'd be doing this in a single query that also sorted the results by the count at the same time!

1 query

At the moment we basically need the 2nd query to get the actual objects. If we could use one raw sql query to do the work for us and somehow use the results of the query to populate the relevant objects for us we'd be golden. After a bit of digging around in the SQLObject source code I looked at the get class method definition:

# in main.py
class SQLObject(object):
    ...
    def get(cls, id, connection=None, selectResults=None):

Further examination showed that if I passed in selectResults (a list of field values) in the right order I could get an object instance either based on the results I passed in, or else the version of the object with the matching id in the cache. Excellent. So now we can have a method that works thus:

@classmethod
def get_entry_views3(cls):
    return select_with_count(cls,EntryView,Entry.q.id==EntryView.q.entryID,orderByDesc=True)

Where the juicy bit is here (to make it more reusable elsewhere):

def select_with_count(selectClass,joinClass,join_on,orderByDesc=False):
    conn=selectClass._connection
    fields = [selectClass.q.id]
    for col in selectClass.sqlmeta.columnList:
        fields.append(getattr(selectClass.q, col.name))

    # name we'll assign to the count
    # so we can sort on it
    count_field=("%s_count"%joinClass.__name__).lower()
    fields.append(SQLConstant('COUNT(%s) %s'%(joinClass.q.id, count_field)))

    orderBy=SQLConstant(count_field)
    if orderByDesc:
        orderBy=DESC(orderBy)

    select=Select(
            fields,
            join=LEFTJOINOn(selectClass,joinClass,join_on),
            groupBy=selectClass.q.id,
            orderBy=orderBy)
    sql=conn.sqlrepr(select)
    return read_from_results(conn.queryAll(sql),selectClass)

def read_from_results(results,selectClass):
    num_columns=len(selectClass.sqlmeta.columnList)
    items=[]
    for result in results:
        id,selectResults,extra=result[0],result[1:num_columns],result[num_columns:]
        entry=selectClass.get(id,selectResults=selectResults)
        items.append((entry,)+extra)
    return items

Which returns results in the same format as the original method and only generate one SQL query:

SELECT entry.id, entry.title, entry.body, COUNT(entry_view.id) entryview_count FROM  entry LEFT JOIN entry_view ON ((entry.id) = (entry_view.entry_id)) GROUP BY entry.id ORDER BY entryview_count DESC

There are a few of fiddly bits going on here that I'll explain.

Firstly I perform a LEFT JOIN and use COUNT(entry_view.id) so we can results for entries that have no views.

Next, the order of the object fields has to match what SQLObject is expecting. That order being defined by the class's sqlmeta.columnList.

Finally to be able to sort by the view count I have to provide a name for the count (entryview_count), which I create based on the EntryView class name.

In conclusion

The example I gave was quite specific, but does show it's possible to slightly better integrate raw SQL queries with SQLObject. This means that it's possible to retain more of the easy to use nature of SQLObject when needing to speed up a few critical queries.

I suspect that with a bit of work it would be possible to create a quite nice library for performing generalised queries with SQLObject and getting nice objects back. For example it may be possible to use such techniques to eagerly load objects in joins (much as you can do in SQLAlchemy or the Java Persitence API).