Data Corruption and Concurrent Updates to Embedded Objects with MongoDB

We use MongoDB at Artsy as our primary data store via the Mongoid ODM. Eventually, we started noticing data corruption inside embedded objects at an alarming rate of 2-3 records a day. The number of occurrences increased rapidly with load as our user growth accelerated.

The root cause was not a HN-worthy sensational declaration about how MongoDB trashes data, but our lack of understanding of what can and cannot be concurrently written to the database, neatly hidden behind the object data mapping layer.

Data Model

Consider the following artwork model with embedded images.

class Artwork
  include Mongoid::Document
  field :title, type: String
  embeds_many :images
end

class Image
  include Mongoid::Document
  embedded_in :artwork
  field :filename, type: String
  field :width, type: Integer  
  field :height, type: Integer
end

Let’s create a few objects and examine the database queries executed when constructing this relationship by setting a DEBUG logger level on the Moped driver used underneath the ODM.

Moped.logger = Logger.new($stdout)
Moped.logger.level = Logger::DEBUG

# db.artworks.insert({
#   _id: ObjectId("510f22c5db8e540aab000001"),
#   title: "Mona Lisa"
# })
artwork = Artwork.create!(title: "Mona Lisa")

image1 = Image.new(filename: "framed.jpg")

# db.artworks.update(
#   { _id: ObjectId("510f22c5db8e540aab000001") },
#   { $push :
#     { images:
#       {
#         _id: ObjectId("510f22c5db8e540aab000002"),
#         filename: "framed.jpg"
#       }
#     }
#   }
# )
artwork.images << image1

image2 = Image.new(filename: "unframed.jpg")
# db.artworks.update(
#   { _id: ObjectId("510f22c5db8e540aab000001") },
#   { $push :
#     { images:
#       {
#         _id: ObjectId("510f22c5db8e540aab000003"),
#         filename: "unframed.jpg"
#       }
#     }
#   }
# )
artwork.images << image2

Here’s the artwork data in MongoDB retrieved from a mongo shell:

> db.artworks.findOne()
{
  "_id" : ObjectId("510f22c5db8e540aab000001"),
  "title" : "Mona Lisa",
  "images" : [
    {
      "_id" : ObjectId("510f22c5db8e540aab000002"),
      "filename" : "framed.jpg"
    },
    {
      "_id" : ObjectId("510f22c5db8e540aab000003"),
      "filename" : "unframed.jpg"
    }
  ]
}

We can modify the attributes of the second image.

# db.artworks.update(
#   { _id: ObjectId("510f22c5db8e540aab000001") },
#   { $set : { "images.1.width" : 30, "images.1.height" : 40 } }
# )
image2.update_attributes!(width: 30, height: 40)

The image has been updated correctly.

> db.artworks.findOne()
{
  "_id" : ObjectId("510f22c5db8e540aab000001"),
  "title" : "Mona Lisa",
  "images" : [
    {
      "_id" : ObjectId("510f22c5db8e540aab000002"),
      "filename" : "framed.jpg"
    },
    {
      "_id" : ObjectId("510f22c5db8e540aab000003"),
      "filename" : "unframed.jpg",
      "height" : 40,
      "width" : 30
    }
  ]
}

Incomplete Record Corruption

Examining the query you will notice that it uses a so-called “positional” operator, images.1.width to update the second record. Imagine what would happen if the first record was deleted from another process immediately before the update. That’s right, the update will be performed on a record that doesn’t exist, in which case the default MongoDB behavior is to create it!

We can simulate this by loading the object in Ruby, pulling the first record directly from the database and then performing the update.

artwork.images << image2

# pull the first artwork directly from the database
Artwork.collection.where(_id: artwork.id).update(
  "$pull" => { "images" => { _id: image1.id } })

image2.update_attributes!(width: 30, height: 40)

This yields a nasty surprise. We now have two records in the embedded collection, the second one missing an _id.

> db.artworks.findOne()
{
  "_id" : ObjectId("510f22c5db8e540aab000001"),
  "title" : "Mona Lisa",
  "images" : [
    {
      "_id" : ObjectId("510f22c5db8e540aab000003"),
      "filename" : "unframed.jpg"
    },
    {
      "height" : 40,
      "width" : 30
    }
  ]
}

When reloaded, Mongoid will assign an automatic _id to the second object, the correct height and width, but no filename.

Null Record Corruption

A similar scenario can play out by pulling both image records out of the embedded collection and making a positional update. This will create a null record, which is much worse, because Mongoid can’t even destroy it, attempting to pull a record with an _id that does not exist.

artwork.images << image2

Artwork.collection.where(_id: artwork.id).update(
  "$pull" => { "images" => { _id: image1.id } })
Artwork.collection.where(_id: artwork.id).update(
  "$pull" => { "images" => { _id: image2.id } })

image2.update_attributes!(width: 30, height: 40)

> db.artworks.findOne()
{
  "_id" : ObjectId("510f22c5db8e540aab000001"),
  "title" : "Mona Lisa"
  "images" : [
    null,
    {
      "height" : 40,
      "width" : 30
    }
  ],
}

Solutions

A first obvious solution is not to use embedded objects or to never modify them. Both $push and $pull are atomic operations, but not the positional update.

A general solution to this problem is to make all update operations transactional. You can take a lock on the parent model by using mongoid-locker. It works, but can be quite tedious depending on the complexity of your application.

Finally, MongoDB supports something called a “positional operator” for embedded objects. This means you can atomically update a record found by its embedded object’s field using a reference to the position of that embedded object. This solves our problem, as long as the object is not embedded below the first level. Mongoid 3.1 (currently HEAD) implements this behavior by default (see #2545 for details), adjusting the selector to look for the embedded object’s _id and replacing the position with a $ positional operator.

# db.artworks.update(
#   {
#     _id: ObjectId("510f22c5db8e540aab000001"),
#     "images._id" : ObjectId("510f22c5db8e540aab000003")
#   },
#   { $set : { "images.$.width" : 30, "images.$.height" : 40 }}
# )
image2.update_attributes!(width: 30, height: 40)

We’ve been successfully running this in production for a few weeks now, without any more data corruption issues.

While this is a huge step forward, covering all of our application’s scenarios, we would like complete native support for atomic updates inside MongoDB at all levels of nesting. Please add your +1 to SERVER-831.

Artsy

Engineering Blog