Haki Benita
16 Apr 2019
β’
11 min read
ORMs offer great utility for developers but abstracting access to the database has its costs. Developers who are willing to poke around the database and change some defaults often find that great improvements can be made.
Prior to Django 2.0 if we wanted to get something like the total number of users and the total number of active users we had to resort to conditional expressions:
from django.contrib.auth.models import User
from django.db.models import (
Count,
Sum,
Case,
When,
Value,
IntegerField,
)
User.objects.aggregate(
total_users=Count('id'),
total_active_users=Sum(Case(
When(is_active=True, then=Value(1)),
default=Value(0),
output_field=IntegerField(),
)),
)
In Django 2.0 a filter
[argument to aggregate functions](https://docs.djangoproject.com/en/2.0/ref/models/querysets# id6) was added to make this a lot easier:
from django.contrib.auth.models import User
from django.db.models import Count, F
User.objects.aggregate(
total_users=Count('id'),
total_active_users=Count('id', filter=F('is_active')),
)
Nice, short and sweet.
If you are using PostgreSQL, the two queries will look like this:
SELECT
COUNT(id) AS total_users,
SUM(CASE WHEN is_active THEN 1 ELSE 0 END) AS total_active_users
FROM
auth_users;
SELECT
COUNT(id) AS total_users,
COUNT(id) FILTER (WHERE is_active) AS total_active_users
FROM
auth_users;
The second query uses the FILTER (WHERE β¦)
clause.
namedtuple
I'm a big fan of namedtuples and apparently starting Django 2.0 so is the ORM.
In Django 2.0 a [new attribute was added to values_list called](https://docs.djangoproject.com/en/2.0/ref/models/querysets# django.db.models.query.QuerySet.values_list) named
. Setting named
to true will return the queryset as a list of namedtuples:
>>> user.objects.values_list(
'first_name',
'last_name',
)[0]
('Haki', 'Benita')
>>> user_names = User.objects.values_list(
'first_name',
'last_name',
named=True,
)
>>> user_names[0]
Row(first_name='Haki', last_name='Benita')
>>> user_names[0].first_name
'Haki'
>>> user_names[0].last_name
'Benita'
Django ORM is very powerful and feature-rich but it can't possibly keep up with all database vendors. Luckily the ORM lets us extend it with custom functions.
Say we have a Report model with a duration field. We want to find the average duration of all reports:
>>> from django.db.models import Avg
>>> Report.objects.aggregate(avg_duration=Avg('duration'))
{'avg_duration': datetime.timedelta(0, 0, 55432)}
That's great, but average alone tells us very little. Let's try to fetch the standard deviation as well:
>>>> from django.db.models import Avg, StdDev
>>> Report.objects.aggregate(
... avg_duration=Avg('duration'),
... std_duration=StdDev('duration'),
... )
ProgrammingError: function stddev_pop(interval) does not exist
LINE 1: SELECT STDDEV_POP("report"."duration") AS "std_dura...
^
HINT: No function matches the given name and argument types.
You might need to add explicit type casts.
Oops⦠PostgreSQL does not support stddev on an interval field - we need to convert the interval to a number before we can apply STDDEV_POP to it.
One option is extracting epoch from the duration:
db=# SELECT
AVG(duration),
STDDEV_POP(EXTRACT(EPOCH FROM duration))
FROM
report;
avg | stddev_pop
------------------+------------------
00:00:00.55432 | 1.06310113695549
(1 row)
So how can we implement this in Django? You guessed it - [a custom function:](https://docs.djangoproject.com/en/2.0/ref/models/expressions# func-expressions)
# common/db.py
from django.db.models import Func
class Epoch(Func):
function = 'EXTRACT'
template = "%(function)s('epoch' from %(expressions)s)"
And use our new function like this:
>>> from django.db.models import Avg, StdDev, F
>>> from common.db import Epoch
>>> Report.objects.aggregate(
... avg_duration=Avg('duration'),
... std_duration=StdDev(Epoch(F('duration'))),
... )
{'avg_duration': datetime.timedelta(0, 0, 55432),
'std_duration': 1.06310113695549}
Notice the use of the [F expression](https://docs.djangoproject.com/en/2.0/ref/models/expressions# f-expressions) in the call to Epoch.
This is probably the easiest and most important tip I can give. We are all humans and we make mistakes. We can't possibly handle each and every edge case so we must set boundaries.
Unlike other non-blocking app servers such as Tornado, asyncio or even Node, Django usually uses synchronous worker processes. This means that when a user executes a long running operation, the worker process is blocked and no one else can use it until it is done.
I'm sure no one is really running Django in production with just one worker process but we still want to make sure a single query is not hogging too much resources for too long.
In most Django apps the majority of time is spent waiting for database queries. So, setting a timeout on SQL queries is a good place to start.
We like setting a global timeout in our wsgi.py
file like this:
# wsgi.py
from django.db.backends.signals import connection_created
from django.dispatch import receiver
@receiver(connection_created)
def setup_postgres(connection, **kwargs):
if connection.vendor != 'postgresql':
return
# Timeout statements after 30 seconds.
with connection.cursor() as cursor:
cursor.execute("SET statement_timeout TO 30000;")
Why wsgi.py? This way it only affects worker processes and not out-of-band analytic queries, cron tasks, etc.
Hopefully, you are using [persistent database connections](https://docs.djangoproject.com/en/2.0/ref/databases# persistent-connections), so this per-connection setup should not add overhead to each request.
The timeout can also be set at the user level:
db=#> alter user app_user set statement_timeout TO 30000;
ALTER ROLE
SIDE NOTE: The other common place we spent a lot of time at is networking. So make sure when you call a remote service to [always set a timeout:](http://docs.python-requests.org/en/master/user/quickstart# timeouts)
import requests
response = requests.get(
'https://api.slow-as-hell.com',
timeout=3000,
)
This is somewhat related to the last point about setting boundaries. Sometimes we want to let users produce reports and maybe export them to a spreadsheet. These types of views are usually the immediate suspects for any weird behavior in production.
It's not uncommon to encounter a user that thinks it's reasonable to export all sales since the dawn of time in the middle of the work day. It's also not uncommon for this same user to open another tab and try again when the first attempt "got stuck".
This is where LIMIT comes in.
Let's limit a certain query to no more than 100 rows:
# bad example
>>> data = list(Sale.objects.all())[:100]
This is the worst thing you can do. You just fetched all gazillion rows into memory just to return the first 100.
Let's try again:
data = Sale.objects.all()[:100]
This is better. Django will use the limit clause in the SQL to fetch only 100 rows.
Now let's say we added the limit, the users are under control and all is good. We still have one problem - the user asked for all the sales and we gave them 100. The user now thinks there are only 100 sales - this is wrong.
Instead of blindly returning the first 100 rows, let's make sure that if there are more than 100 rows (normally after filtering) we throw an exception:
LIMIT = 100
if Sales.objects.count() > LIMIT:
raise ExceededLimit(LIMIT)
return Sale.objects.all()[:LIMIT]
This will work but we just added another query.
Can we do better? I think we can:
LIMIT = 100
data = Sale.objects.all()[:(LIMIT + 1)]
if len(data) > LIMIT:
raise ExceededLimit(LIMIT)
return data
Instead of fetching 100 rows, we fetch 100 + 1 = 101 rows. If the 101 row exists it's enough for us to know there is more than 100 rows. Or in other words, fetching LIMIT + 1 rows is the least we need to make sure there are no more than LIMIT rows in the query result.
Remember the LIMIT + 1 trick, it can come pretty handy at times.
This one we learned the hard way. We started getting errors in the middle of the night about transactions timing out due to locks in the database.
A common pattern for manipulating a transaction in our code would look like this:
from django.db import transaction as db_transaction
# ...
with db_transaction.atomic():
transaction = (
Transaction.objects
.select_related(
'user',
'product',
'product__category',
)
.select_for_update()
.get(uid=uid)
)
# ...
Manipulating the transaction usually involves some properties from the user and the product so we often use select_related to force a join and save some queries.
Updating the transaction also involves obtaining a lock to make sure it's not being manipulated by anyone else.
Now, do you see the problem? NO? Neither did we.
We had some ETL processes running at night performing maintenance on the product and user tables. These ETLs performed updates and inserts to the tables so they also obtained locks on the tables.
So what was the problem? When select_for_update is used along with select_related, Django will attempt to obtain a lock on all the tables in the query.
The code we used to fetch the transaction tried to obtain a lock on both the transaction table and the users, product and category tables. Once the ETL locked the last three tables in the middle of the night transactions started to fail.
Once we had a better understanding of the problem we started looking for ways to lock only the necessary table - the transaction table. [Luckily, A new option to select_for_update just became available in Django 2.0:](https://docs.djangoproject.com/en/2.0/releases/2.0# database-backend-api)
from django.db import transaction as db_transaction
# ...
with db_transaction.atomic():
transaction = (
Transaction.objects
.select_related(
'user',
'product',
'product__category',
)
.select_for_update(
of=('self',)
)
.get(uid=uid)
)
# ...
The of
option was added to [select_for_update.](https://docs.djangoproject.com/en/2.0/ref/models/querysets# django.db.models.query.QuerySet.select_for_update) Using of we can explicitly state which tables we want to lock. self is a special keyword indicating we want to lock the model we are working on, in this case, the Transaction.
Currently, this feature is only available for the PostgreSQL and Oracle backends.
When creating a model, Django will automatically create a B-Tree index on any foreign key. B-Tree indexes can get pretty heavy and sometimes they are not really necessary.
A classic example is a through model for an M2M relation:
class Membership(Model):
group = ForeignKey(Group)
user = ForeignKey(User)
In the model above Django will implicitly create two indexes - one for user and one for group.
Another common pattern in M2M models is adding a unique constraint on the two fields. In our case it means that a user can only be a member of the same group once:
class Membership(Model):
group = ForeignKey(Group)
user = ForeignKey(User)
class Meta:
unique_together = (
('group', 'user',)
)
The unique_together will also create an index on both fields. So we get one model with two fields and three indexes.
Depending on the work we do with this model, many times we can dismiss the FK indexes and keep only the one created by the unique constraint:
class Membership(Model):
group = ForeignKey(
Group,
db_index=False,
)
user = ForeignKey(
User,
db_index=False,
)
class Meta:
unique_together = (
('group', 'user',)
)
Removing redundant indexes will make insert and updates faster, plus, our database is now lighter which is always a good thing.
Indexes with more than one column are called composite indexes. In B-Tree composite indexes the first column is indexed using a tree structure. From the leafs of the first level a new tree is created for the second level and so on.
The order of the columns in the index is significant.
In the example above we would get a tree for groups first, and for each group another tree for all it's users.
The rule of thumb for B-Tree composite indexes is to make the secondary indexes as small as possible. In other words, columns with high cardinality (more distinct values) should come first.
In our example it's reasonable to assume there are more users than groups so puting the user column first will make the secondary index on group, smaller.
class Membership(Model):
group = ForeignKey(
Group,
db_index=False,
)
user = ForeignKey(
User,
db_index=False,
)
class Meta:
unique_together = (
('user', 'group', )
)
This is just a rule of thumb and it should be taken with a grain of salt. The final indexing should be optimized for the specific use case. The main point here is to be aware of implicit indexes and the significance of the column order in composite indexes.
A B-Tree index is structured like a tree. The cost of looking up a single value is the height of the tree + 1 for the random access to the table. This makes B-Tree indexes ideal for unique constraints and (some) range queries.
The disadvantage of B-Tree index is its size - B-Tree indexes can get big.
It's not uncommon to think there are no alternatives but databases offer other types of indexes for specific use cases.
Starting with Django 1.11 there is a new Meta option for creating indexes on a model. This gives us an opportunity to explore other types of indexes.
PostgreSQL has a very useful type of index called BRIN (Block Range Index). Under some circumstances BRIN indexes can be more efficient than B-Tree indexes.
Let's see what the official documentation has to say first:
BRIN is designed for handling very large tables in which certain columns have some
natural correlation with their physical location within the table.
To understand this statement it's important to understand how BRIN index works. As the name suggest, a BRIN index will create a mini index on a range of adjacent blocks in the table. The index is very small and it can only say if a certain value is definitely not in the range or if it might be in the range of indexed blocks.
Let's do a simplified example of how BRIN works to help us understand.
Say we have these values in a column, each is one block:
1, 2, 3, 4, 5, 6, 7, 8, 9
Let's create a range for each 3 adjacent blocks:
[1,2,3], [4,5,6], [7,8,9]
For each range we are going to keep the minimum and maximum value in the range:
[1β3], [4β6], [7β9]
Using this index, let's try to search for the value 5:
Using the index we limited our search to blocks 4β6.
Let's take another example, this time the values in the column are not going to be nicely sorted:
[2,9,5], [1,4,7], [3,8,6]
And this is our index with the minimum and maximum value in each range:
[2β9], [1β7], [3β8]
Let's try to search for the value 5:
The index is useless - not only did it not limit the search at all, we actually had to read more because we fetched both the index and the entire table.
Going back to the documentation:
columns have some natural correlation with their physical location within the table.
This is key for BRIN indexes. To get the most out of it, the values in the column must be roughly sorted or clustered on disk.
Now back to Django, what field do we have that is often indexed and will most likely be naturally sorted on disk? That's right, I'm looking at you [auto_now_add.](https://docs.djangoproject.com/en/2.0/ref/models/fields# datefield)
A very common pattern in Django models is this:
class SomeModel(Model):
created = DatetimeField(
auto_now_add=True,
)
When auto_now_add is used Django will automatically populate the field with the current time when the row is created
. A created field is usually also a great candidate for queries so it's often indexed.
Let's add a BRIN index on created
:
from django.contrib.postgres.indexes import BrinIndex
class SomeModel(Model):
created = DatetimeField(
auto_now_add=True,
)
class Meta:
indexes = (
BrinIndex(fields=['created']),
)
To get a sense of the difference in size I created a table with ~2M rows with a date field that is naturally sorted on disk:
There are a lot more to consider when creating indexes than the size of the index. But now, with Django 1.11 support for indexes, we can easily integrate new types of indexes into our apps and make them lighter and faster.
Follow the author on Twitter here
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!