Writing thread-safe django - get_or_create

By : Javed Khan

In this blog post, we'll discuss thread-safety, why it's important and how to write thread-safe django code, especially for bulk operations like management commands. We'll take a simple example - get or create.

Thread-safety:

Thread-safety means that our code can be run in multiple threads and behave as expected. The reason that code can be unsafe with regard to threads is because we'll be manipulating shared memory (e.g. database) from the threads and there's a chance of a race-condition which will produce unexpected results.

To avoid this, we have the option of using read-write locks, transactions etc.

We'll look at some simple examples and try to understand these options.

The usual way:

Let's consider a management command that syncs data from another source (e.g. API, remote database etc.. The correct way to do this would be to use the built-in django utility - get_or_create:

Update: Updated the command to run each arg in a thread

class MyThread(Thread):

    def __init__(self, my_id):
        super(MyThread, self).__init__(name=my_id)
        self.my_id = my_id

    def run(self):
        instance, created = MyModel.objects.get_or_create(my_id=my_id)
        print '%s %s' % (instance.id, created)
        instance.delete()
        return


class Command(BaseCommand):
    args = '<my_id my_id ...>'
    help = 'Get or create instace of mymodel with my_id'

    def handle(self, *args, **options):
        for my_id in args:
            thread = MyThread(my_id=my_id)
            thread.start()

In this command, we'll be using the command line arg my_id get a MyModel instance if it exists, else, we create one.

Note: we'll be discarding the object for simplicity.

Now, this management command (let's call it sync_myapp) is thread-safe because we're just using the built-in get_or_create to do our database call.

To test this, run the command with same arg repeated multiple times:

$> python manage.py sync_myapp 1 1 1 1 1

This will create five threads which will simultaneously try to get or create an instance with my_id = 1

$> 1 True
$> 1 False
$> 1 False
$> 1 False

Note that all the threads are successful, even though they were working on the same database at the same time.

Now, even though get_or_create is the way to go, we might need to customize a few things which are outside the scope of get_or_create. For example, let's say we need to do something special just before creating a new instance.

The problem:

Let's assume our code was:

def run(self):
    created = False
    try:
        instance = MyModel.objects.get(my_id=my_id)
    except MyModel.DoesNotExist:
        something_special()
        instance = MyModel.objects.create(my_id=my_id)
        created = True
    instance.delete()
    return

If we try to test it with the above command, you'll probably get:

Exception in thread 1:
Traceback (most recent call last):
....
IntegrityError: duplicate key value violates unique constraint "myapp_mymodel_my_id_key"
DETAIL:  Key (my_id)=(1) already exists.

This indicates that the try except block isn't thread-safe.

Let's try to fix this problem.

Update: I've removed the section dealing with locks since it's only useful when dealing with shared memory in python processes, it's not applicable to databases.

Using database transactions:

We use django's built-in transactions module as follows:

Update: We only need to wrap the create command in a transaction

def run(self):
    created = False
    try:
        instance = MyModel.objects.get(my_id=self.my_id)
    except MyModel.DoesNotExist:
            try:
                something_special()
                with transaction.commit_on_success():
                    instance = MyModel.objects.create(my_id=self.my_id)
                created = True
            except IntegrityError:
                instance = MyModel.objects.get(my_id=self.my_id)
    print '%s %s' % (instance.id, created)
    instance.delete()
    return

Now, the transaction will be committed only if there's no error within the context block, so we can be sure that only one thread gets the go-ahead for create call.

In addition to the context manager, django also has options for using savepoints, manually commits, rollbacks etc.

https://docs.djangoproject.com/en/1.5/topics/db/transactions

Caveats:

If you're using MySQL, refer to this open issue on problem with get_or_create:

https://code.djangoproject.com/ticket/13906

Conclusion:

Using database transactions, we can avoid data integrity issues and write thread-safe code which can be run in parallel without any issues.


Related Posts


Can we help you build amazing apps? Contact us today.

Topics : django python threads

Comments

Patryk Zawadzki 21st Aug., 2013

Except that… parallel does not use threads. What you test for is race conditions, not thread safety. Thread safety involves heaps of other problems like shared memory, locking and reentrancy.

commmenttor
Owais Lone

Like Patryk said, this is avoiding race conditions, not thread safety.

>>> Now, this management command (let's call it sync_myapp) is thread-safe because we're just using the built-in get_or_create to do our database call.

get_or_create is built in but it has no way of avoiding race conditions. What it does internally is fetch a row, and insert the row if nothing was fetched. It is the same as loading, catching DoesNotExist and then inserting inside the exception handler block.

Check https://github.com/django/django/blob/f7290581fe2106c08d97215ab93e27cf6b27e100/django/db/models/query.py#L408

To handle the race condition, you need to wrap get_or_create inside managed DB transactions or do some locking in the DB (row-level/table-level; whatever makes sense) itself and not in application code.

commmenttor
Javed Khan

@Patryk thanks, I've removed parallel and added threads for testing it. Yes, this blog post only covers dealing with race conditions.

@Owais There are transactions in place to avoid race conditions, see:

https://github.com/django/django/blob/f7290581fe2106c08d97215ab93e27cf6b27e100/django/db/models/query.py#L449

Though, it looks like it depends on the transaction level. See https://code.djangoproject.com/ticket/13906#comment:38

commmenttor
Owais Lone

Look closely. It is creating savepoints, not managing a transaction. It is only specifying how much to rollback when you (the developer) choose to rollback. You still need to manage transactions.

Read up on savepoints https://docs.djangoproject.com/en/dev/topics/db/transactions/#savepoints

commmenttor
Javed Khan

@Owais,

I should have mentioned I'm testing this on django 1.5 and postgresql 9.1. Here's the test I'm running:

https://gist.github.com/tuxcanfly/6339390

If you comment the get_or_create and uncomment the try..except block, you should see IntegrityErrors

I think, with get_or_create, we don't need to manage the transaction because of django's default commit behavior which commits on save(). In case of an IntegrityError, get_or_create rolls back to the savepoint, so it handles the race condition.

commmenttor
Under floden

Thanks a ton so much for just a incredibly intelligent and also refreshing write-up. Great job!

commmenttor
casey evans

Thread-safe django is very much helpful to us cause we have bulk operation. Thank You for this wonderful information and easy to understand command in your scenario.

commmenttor
Qu'est-ce que la mesure de Ray Ban Wayfarer Lunettes de soleil

Writing thread-safe django - get_or_create - Agiliq Blog | Django web app development

commmenttor
© Agiliq, 2009-2012