diff --git a/source/tutorial/usecase/cms-metadata-and-asset-management.txt b/source/tutorial/usecase/cms-metadata-and-asset-management.txt new file mode 100644 index 00000000000..e3a4886c3e0 --- /dev/null +++ b/source/tutorial/usecase/cms-metadata-and-asset-management.txt @@ -0,0 +1,453 @@ +================================== +CMS: Metadata and Asset Management +================================== + +Problem +======= + +You are designing a content management system (CMS) and you want to use +MongoDB to store the content of your sites. + +Solution Overview +================= + +The approach in this solution is inspired by the design of Drupal, an +open source CMS written in PHP on relational databases that is available +at `http://www.drupal.org `_. In this case, you +will take advantage of MongoDB's dynamically typed collections to +*polymorphically* store all your content nodes in the same collection. +Navigational information will be stored in its own collection since +it has relatively little in common with the content nodes, and is not covered in +this use case. + +The main node types which are covered here are: + +Basic page + Basic pages are useful for displaying + infrequently-changing text such as an 'about' page. With a basic + page, the salient information is the title and the + content. +Blog entry + Blog entries record a "stream" of posts from users + on the CMS and store title, author, content, and date as relevant + information. +Photo + Photos participate in photo galleries, and store title, + description, author, and date along with the actual photo binary + data. + +Schema Design +============= + +The node collection contains documents of various formats, but they +will all share a similar structure, with each document including an +``_id``, ``type``, ``section``, ``slug``, ``title``, ``created`` date, +``author``, and ``tags``. The +``section`` property is used to identify groupings of items (grouped to a +particular blog or photo gallery, for instance). The ``slug`` property is +a url-friendly representation of the node that is unique within its +section, and is used for mapping URLs to nodes. Each document also +contains a ``detail`` field which will vary per document type: + +.. code-block:: javascript + + { + _id: ObjectId(…), + nonce: ObjectId(…), + metadata: { + type: 'basic-page' + section: 'my-photos', + slug: 'about', + title: 'About Us', + created: ISODate(...), + author: { _id: ObjectId(…), name: 'Rick' }, + tags: [ ... ], + detail: { text: '# About Us\n…' } + } + } + +For the basic page above, the detail field might simply contain the text +of the page. In the case of a blog entry, the document might resemble +the following instead: + +.. code-block:: javascript + + { + … + metadata: { + … + type: 'blog-entry', + section: 'my-blog', + slug: '2012-03-noticed-the-news', + … + detail: { + publish_on: ISODate(…), + text: 'I noticed the news from Washington today…' + } + } + } + +Photos present something of a special case. Since you'll need to store +potentially very large photos, it's nice to be able to separate the binary +storage of photo data from the metadata storage. GridFS provides just such a +mechanism, splitting a "filesystem" of potentially very large "files" into +two collections, the ``files`` collection and the ``chunks`` collection. In +this case, the two collections will be called ``cms.assets.files`` and +``cms.assets.chunks``. Documents in the ``cms.assets.files`` +collection will be used to store the normal GridFS metadata as well as CMS node +metadata: + +.. code-block:: javascript + + { + _id: ObjectId(…), + length: 123..., + chunkSize: 262144, + uploadDate: ISODate(…), + contentType: 'image/jpeg', + md5: 'ba49a...', + metadata: { + nonce: ObjectId(…), + slug: '2012-03-invisible-bicycle', + type: 'photo', + section: 'my-album', + title: 'Kitteh', + created: ISODate(…), + author: { _id: ObjectId(…), name: 'Jared' }, + tags: [ … ], + detail: { + filename: 'kitteh_invisible_bike.jpg', + resolution: [ 1600, 1600 ], … } + } + } + +NOte that the "normal" node schema is embedded here in the photo schema, allowing +the use of the same code to manipulate nodes of all types. + +Operations +========== + +Here, some common queries and updates that you might need for your CMS are +described, paying particular attention to any "tweaks" necessary for the various +node types. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. + +Create and Edit Content Nodes +----------------------------- + +The content producers using your CMS will be creating and editing content +most of the time. Most content-creation activities are relatively +straightforward: + +.. code-block:: python + + db.cms.nodes.insert({ + 'nonce': ObjectId(), + 'metadata': { + 'section': 'myblog', + 'slug': '2012-03-noticed-the-news', + 'type': 'blog-entry', + 'title': 'Noticed in the News', + 'created': datetime.utcnow(), + 'author': { 'id': user_id, 'name': 'Rick' }, + 'tags': [ 'news', 'musings' ], + 'detail': { + 'publish_on': datetime.utcnow(), + 'text': 'I noticed the news from Washington today…' } + } + }) + +Once the node is in the database, there is a potential problem with +multiple editors. In order to support this, the schema uses the special ``nonce`` +value to detect when another editor may have modified the document and +allow the application to resolve any conflicts: + +.. code-block:: python + + def update_text(section, slug, nonce, text): + result = db.cms.nodes.update( + { 'metadata.section': section, + 'metadata.slug': slug, + 'nonce': nonce }, + { '$set':{'metadata.detail.text': text, 'nonce': ObjectId() } }, + safe=True) + if not result['updatedExisting']: + raise ConflictError() + +You might also want to perform metadata edits to the item such as adding +tags: + +.. code-block:: python + + db.cms.nodes.update( + { 'metadata.section': section, 'metadata.slug': slug }, + { '$addToSet': { 'tags': { '$each': [ 'interesting', 'funny' ] } } }) + +In this case, you don't actually need to supply the nonce (nor update it) +since you're using the atomic ``$addToSet`` modifier in MongoDB. + +Index Support +~~~~~~~~~~~~~ + +Updates in this case are based on equality queries containing the +(``section``, ``slug``, and ``nonce``) values. To support these queries, you +*might* use the following index: + +.. code-block:: python + + >>> db.cms.nodes.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1), ('nonce', 1) ]) + +Also note, however, that you'd like to ensure that two editors don't +create two documents with the same section and slug. To support this, you need a +second index with a unique constraint: + +.. code-block:: python + + >>> db.cms.nodes.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) + +In fact, since the expectation is that most of the time (``section``, ``slug``, +``nonce``) is going to be unique, you don't actually get much benefit from the +first index and can use only the second one to satisfy the update queries as +well. + +Upload a Photo +-------------- + +Uploading photos shares some things in common with node +update, but it also has some extra nuances: + +.. code-block:: python + + def upload_new_photo( + input_file, section, slug, title, author, tags, details): + fs = GridFS(db, 'cms.assets') + with fs.new_file( + content_type='image/jpeg', + metadata=dict( + type='photo', + locked=datetime.utcnow(), + section=section, + slug=slug, + title=title, + created=datetime.utcnow(), + author=author, + tags=tags, + detail=detail)) as upload_file: + while True: + chunk = input_file.read(upload_file.chunk_size) + if not chunk: break + upload_file.write(chunk) + # unlock the file + db.assets.files.update( + {'_id': upload_file._id}, + {'$set': { 'locked': None } } ) + +Here, since uploading the photo is a non-atomic operation, you need to +"lock" the file during upload by writing the current datetime into the +record. This lets the application detect when a file upload may be stalled, which +is helpful when working with multiple editors. This solution assumes that, for +photo upload, the last update wins: + +.. code-block:: python + + def update_photo_content(input_file, section, slug): + fs = GridFS(db, 'cms.assets') + + # Delete the old version if it's unlocked or was locked more than 5 + # minutes ago + file_obj = db.cms.assets.find_one( + { 'metadata.section': section, + 'metadata.slug': slug, + 'metadata.locked': None }) + if file_obj is None: + threshold = datetime.utcnow() - timedelta(seconds=300) + file_obj = db.cms.assets.find_one( + { 'metadata.section': section, + 'metadata.slug': slug, + 'metadata.locked': { '$lt': threshold } }) + if file_obj is None: raise FileDoesNotExist() + fs.delete(file_obj['_id']) + + # update content, keep metadata unchanged + file_obj['locked'] = datetime.utcnow() + with fs.new_file(**file_obj): + while True: + chunk = input_file.read(upload_file.chunk_size) + if not chunk: break + upload_file.write(chunk) + # unlock the file + db.assets.files.update( + {'_id': upload_file._id}, + {'$set': { 'locked': None } } ) + +You can, of course, perform metadata edits to the item such as adding +tags without the extra complexity: + +.. code-block:: python + + db.cms.assets.files.update( + { 'metadata.section': section, 'metadata.slug': slug }, + { '$addToSet': { + 'metadata.tags': { '$each': [ 'interesting', 'funny' ] } } }) + +Index Support +~~~~~~~~~~~~~ + +Updates here are also based on equality queries containing the +(``section``, ``slug``) values, so you can use the same types of indexes as were +used in the "regular" node case. Note in particular that you need a +unique constraint on (``section``, ``slug``) to ensure that one of the calls to +``GridFS.new_file()`` will fail if multiple editors try to create or update +the file simultaneously. + +.. code-block:: python + + >>> db.cms.assets.files.ensure_index([ + ... ('metadata.section', 1), ('metadata.slug', 1)], unique=True) + +Locate and Render a Node +------------------------ + +You need to be able to locate a node based on its section and slug, which +have been extracted from the page definition and URL by some +other technology. + +.. code-block:: python + + node = db.nodes.find_one( + {'metadata.section': section, 'metadata.slug': slug }) + +Index Support +~~~~~~~~~~~~~ + +The same indexes defined above on (``section``, ``slug``) would +efficiently render this node. + +Locate and Render a Photo +------------------------- + +You want to locate an image based on its section and slug, +which have been extracted from the page definition and URL +just as with other nodes. + +.. code-block:: python + + fs = GridFS(db, 'cms.assets') + with fs.get_version( + **{'metadata.section': section, 'metadata.slug': slug }) as img_fp: + # do something with the image file + +Index Support +~~~~~~~~~~~~~ + +The same indexes defined above on (``section``, ``slug``) would also +efficiently render this image. + +Search for Nodes by Tag +----------------------- + +You'd like to retrieve a list of nodes based on their tags: + +.. code-block:: python + + nodes = db.nodes.find({'metadata.tags': tag }) + +Index Support +~~~~~~~~~~~~~ + +To support searching efficiently, you should define indexes on any fields +you intend on using in your query: + +.. code-block:: python + + >>> db.cms.nodes.ensure_index('tags') + +Search for Images by Tag +------------------------ + +Here, you'd like to retrieve a list of images based on their tags: + +.. code-block:: python + + image_file_objects = db.cms.assets.files.find({'metadata.tags': tag }) + fs = GridFS(db, 'cms.assets') + for image_file_object in db.cms.assets.files.find( + {'metadata.tags': tag }): + image_file = fs.get(image_file_object['_id']) + # do something with the image file + +Index Support +~~~~~~~~~~~~~ + +As above, in order to support searching efficiently, you should define +indexes on any fields you expect to use in the query: + +.. code-block:: python + + >>> db.cms.assets.files.ensure_index('tags') + +Generate a Feed of Recently Published Blog Articles +--------------------------------------------------- + +Here, you need to generate an .rss or .atom feed for your recently +published blog articles, sorted by date descending: + +.. code-block:: python + + articles = db.nodes.find({ + 'metadata.section': 'my-blog' + 'metadata.published': { '$lt': datetime.utcnow() } }) + articles = articles.sort({'metadata.published': -1}) + +Index Support +~~~~~~~~~~~~~ + +In order to support this operation, you'll need to create an index on (``section``, +``published``) so the items are 'in order' for the query. Note that in cases +where you're sorting or using range queries, as here, the field on which +you're sorting or performing a range query must be the final field in the +index: + +.. code-block:: python + + >>> db.cms.nodes.ensure_index( + ... [ ('metadata.section', 1), ('metadata.published', -1) ]) + +Sharding +======== + +In a CMS system, read performance is generally much more important +than write performance. As such, you'll want to optimize the sharding setup +for read performance. In order to achieve the best read performance, you +need to ensure that queries are *routeable* by the mongos process. A +second consideration when sharding is that unique indexes do not span +shards. As such, the shard key must include the unique indexes in order to get +the same semantics as described above. Given +these constraints, sharding the nodes and assets on (``section``, ``slug``) +is a reasonable approach: + +.. code-block:: python + + >>> db.command('shardcollection', 'cms.nodes', { + ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) + { "collectionsharded" : "cms.nodes", "ok" : 1 } + >>> db.command('shardcollection', 'cms.assets.files', { + ... key : { 'metadata.section': 1, 'metadata.slug' : 1 } }) + { "collectionsharded" : "cms.assets.files", "ok" : 1 } + +If you wish to shard the ``cms.assets.chunks`` collection, you need to shard +on the ``_id`` field (none of the node metadata is available on the +``cms.assets.chunks`` collection in GridFS:) + +.. code-block:: python + + >>> db.command('shardcollection', 'cms.assets.chunks' + { "collectionsharded" : "cms.assets.chunks", "ok" : 1 } + +This actually still maintains the query-routability constraint, since +all reads from GridFS must first look up the document in ``cms.assets.files`` and +then look up the chunks separately (though the GridFS API sometimes +hides this detail.) diff --git a/source/tutorial/usecase/cms-storing-comments.txt b/source/tutorial/usecase/cms-storing-comments.txt new file mode 100644 index 00000000000..fd3551fa696 --- /dev/null +++ b/source/tutorial/usecase/cms-storing-comments.txt @@ -0,0 +1,600 @@ +===================== +CMS: Storing Comments +===================== + +Problem +======= + +In your content management system (CMS), you would like to store +user-generated comments on the various types of content you generate. + +Solution Overview +================= + +Rather than describing the One True Way to implement comments in this +solution, this use case explores different options and the trade-offs with +each. The three major designs discussed here are: + +One document per comment + This provides the greatest degree of + flexibility, as it is relatively straightforward to display the + comments as either threaded or chronological. There are also no + restrictions on the number of comments that can participate in a + discussion. +All comments embedded + In this design, all the comments are + embedded in their parent document, whether that be a blog article, + news story, or forum topic. This can be the highest performance + design, but is also the most restrictive, as the display format of + the comments is tied to the embedded structure. There are also + potential problems with extremely active discussions where the total + data (topic data + comments) exceeds the 16MB limit of MongoDB + documents. +Hybrid design + Here, you store comments separately from their + parent topic, but aggregate comments together into a few + documents, each containing many comments. + +Another decision that needs to be considered in designing a commenting +system is whether to support threaded commenting (explicit replies to a +parent comment). This threaded comment support +decision will also be discussed below. + +Schema design: One Document Per Comment +======================================= + +A comment in the one document per comment format might have a structure +similar to the following: + +.. code-block:: javascript + + { + _id: ObjectId(...), + discussion_id: ObjectId(...), + slug: '34db', + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' + } + +The format above is really only suitable for chronological display of +commentary. It maintains a reference to the discussion in which this +comment participates, a url-friendly ``slug`` to identify it, ``posted`` time +and ``author``, and the comment's ``text``. If you want to support threading in +this format, you need to maintain some notion of hierarchy in the comment +model as well: + +.. code-block:: javascript + + { + _id: ObjectId(...), + discussion_id: ObjectId(...), + parent_id: ObjectId(...), + slug: '34db/8bda', + full_slug: '34db:2012.02.08.12.21.08/8bda:2012.02.09.22.19.16', + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' + } + +Here, the schema includes some extra information into the document that +represents this document's position in the hierarchy. In addition to +maintaining the ``parent_id`` for the comment, the slug format has been modified +and a new field ``full_slug`` has been added. The slug is now a path +consisting of the parent's slug plus the comment's unique slug portion. +The ``full_slug`` is also included to facilitate sorting documents in a +threaded discussion by posting date. + +Operations: One Comment Per Document +==================================== + +Here, some common operations that you might need for your CMS are +described in the context of the single comment per document schema. All of the +following examples use the Python programming language and the ``pymongo`` +MongoDB driver, but implementations would be similar in other languages as well. + +Post a New Comment +------------------ + +In order to post a new comment in a chronologically ordered (unthreaded) +system, all you need to do is ``insert()``: + +.. code-block:: python + + slug = generate_psuedorandom_slug() + db.comments.insert({ + 'discussion_id': discussion_id, + 'slug': slug, + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text }) + +In the case of a threaded discussion, there is a bit more work to do in +order to generate a "pathed" ``slug`` and ``full_slug``: + +.. code-block:: python + + posted = datetime.utcnow() + + # generate the unique portions of the slug and full_slug + slug_part = generate_psuedorandom_slug() + full_slug_part = slug_part + ':' + posted.strftime( + '%Y.%m.%d.%H.%M.%S') + + # load the parent comment (if any) + if parent_slug: + parent = db.comments.find_one( + {'discussion_id': discussion_id, 'slug': parent_slug }) + slug = parent['slug'] + '/' + slug_part + full_slug = parent['full_slug'] + '/' + full_slug_part + else: + slug = slug_part + full_slug = full_slug_part + + # actually insert the comment + db.comments.insert({ + 'discussion_id': discussion_id, + 'slug': slug, 'full_slug': full_slug, + 'posted': posted, + 'author': author_info, + 'text': comment_text }) + +View the (Paginated) Comments for a Discussion +---------------------------------------------- + +To actually view the comments in the non-threaded design, you need merely +to select all comments participating in a discussion, sorted by ``posted``: + +.. code-block:: python + + cursor = db.comments.find({'discussion_id': discussion_id}) + cursor = cursor.sort('posted') + cursor = cursor.skip(page_num * page_size) + cursor = cursor.limit(page_size) + +Since the ``full_slug`` embeds both hierarchical information via the path +and chronological information, you can use a simple sort on the +``full_slug`` property to retrieve a threaded view: + +.. code-block:: python + + cursor = db.comments.find({'discussion_id': discussion_id}) + cursor = cursor.sort('full_slug') + cursor = cursor.skip(page_num * page_size) + cursor = cursor.limit(page_size) + +Index Support +~~~~~~~~~~~~~ + +In order to efficiently support the queries above, you should maintain +two compound indexes, one on (``discussion_id``, ``posted``), and the other on +(``discussion_id``, ``full_slug``): + +.. code-block:: python + + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('posted', 1)]) + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('full_slug', 1)]) + +Note that you must ensure that the final element in a compound index is +the field by which you are sorting to ensure efficient performance of +these queries. + +Retrieve a Comment Via Slug ("Permalink") +----------------------------------------- + +Suppose you wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, you'd +simply use the ``slug``: + +.. code-block:: python + + comment = db.comments.find_one({ + 'discussion_id': discussion_id, + 'slug': comment_slug}) + +You can also retrieve a sub-discussion (a comment and all of its +descendants recursively) by performing a prefix query on the ``full_slug`` +field: + +.. code-block:: python + + subdiscussion = db.comments.find_one({ + 'discussion_id': discussion_id, + 'full_slug': re.compile('^' + re.escape(parent_slug)) }) + subdiscussion = subdiscussion.sort('full_slug') + +Index Support +~~~~~~~~~~~~~ + +Since you already have indexes on (``discussion_id``, ``full_slug``) necessary to +support retrieval of subdiscussions, all you need to add here is an index on +(``discussion_id``, ``slug``) to efficiently support retrieval of a comment by +'permalink': + +.. code-block:: python + + >>> db.comments.ensure_index([ + ... ('discussion_id', 1), ('slug', 1)]) + +Schema Design: All Comments Embedded +==================================== + +In this design, you wish to embed an entire discussion within its topic +document, be it a blog article, news story, or discussion thread. A +topic document, then, might look something like the following: + +.. code-block:: python + + { + _id: ObjectId(...), + ... lots of topic data ... + comments: [ + { posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' }, + ... ] + } + +The format above is really only suitable for chronological display of +commentary. The comments are embedded in chronological order, with their +posting date, author, and text. Note that, since you're storing the +comments in sorted order, there is no longer need to maintain a slug per +comment. If you wanted to support threading in the embedded format, you'd need +to embed comments within comments: + +.. code-block:: python + + { + _id: ObjectId(...), + ... lots of topic data ... + replies: [ + { posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + + + text: 'This is so bogus ... ', + replies: [ + { author: { ... }, ... }, + ... ] + } + +Here, there is a ``replies`` property added to each comment which can hold +sub-comments and so on. One thing in particular to note about the +embedded document formats is you give up some flexibility when embedding +the comments, effectively "baking in" the decisions made about +the proper display format. If you (or your users) someday wish to switch +from chronological or vice-versa, this schema makes such a migration +quite expensive. + +In popular discussions, you also might have an issue with document size. +If you have a particularly avid discussion, for example, it might +outgrow the 16MB limit that MongoDB places on document size. You can also +run into scaling issues, particularly in the threaded design, as +documents need to be frequently moved on disk as they outgrow the space +allocated to them. + +Operations: All Comments Embedded +================================= + +Here, some common operations that you might need for your CMS are +described in the context of embedded comment schema. Once again, the examples are +in Python. Note that, in all the cases below, +there is no need for additional indexes since all the operations are +intra-document, and the document itself (the "discussion") is retrieved +by its ``_id`` field, which is automatically indexed by MongoDB anyway. + +Post a new comment +------------------ + +In order to post a new comment in a chronologically ordered (unthreaded) +system, you need the following ``update()``: + +.. code-block:: python + + db.discussion.update( + { 'discussion_id': discussion_id }, + { '$push': { 'comments': { + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text } } } ) + +Note that since you used the ``$push`` operator, all the comments will be +inserted in their correct chronological order. In the case of a threaded +discussion, there si a good bit more work to do. In order to reply to a +comment, the code below assumes that it has access to the 'path' to the comment +you're replying to as a list of positions: + +.. code-block:: python + + if path != []: + str_path = '.'.join('replies.%d' % part for part in path) + str_path += '.replies' + else: + str_path = 'replies' + db.discussion.update( + { 'discussion_id': discussion_id }, + { '$push': { + str_path: { + 'posted': datetime.utcnow(), + 'author': author_info, + 'text': comment_text } } } ) + +Here, you first construct a field name of the form +``replies.0.replies.2...`` as ``str_path`` and then use that to ``$push`` the new +comment into its parent comment's ``replies`` property. + +View the (Paginated) Comments for a Discussion +----------------------------------------------- + +To actually view the comments in the non-threaded design, you need to use +the ``$slice`` operator: + +.. code-block:: python + + discussion = db.discussion.find_one( + {'discussion_id': discussion_id}, + { ... some fields relevant to your page from the root discussion ..., + 'comments': { '$slice': [ page_num * page_size, page_size ] } + }) + +If you wish to view paginated comments for the threaded design, you need +to retrieve the whole document and paginate in your application: + +.. code-block:: python + + discussion = db.discussion.find_one({'discussion_id': discussion_id}) + + def iter_comments(obj): + for reply in obj['replies']: + yield reply + for subreply in iter_comments(reply): + yield subreply + + paginated_comments = itertools.slice( + iter_comments(discussion), + page_size * page_num, + page_size * (page_num + 1)) + +Retrieve a Comment Via Position or Path ("Permalink") +----------------------------------------------------- + +Instead of using slugs as above, this example retrieves comments by their +position in the comment list or tree. In the case of the chronological +(non-threaded) design, you need simply to use the ``$slice`` operator to +extract the correct comment: + +.. code-block:: python + + discussion = db.discussion.find_one( + {'discussion_id': discussion_id}, + {'comments': { '$slice': [ position, position ] } }) + comment = discussion['comments'][0] + +In the case of the threaded design, you're faced with the task of +finding the correct path through the tree in your application: + +.. code-block:: python + + discussion = db.discussion.find_one({'discussion_id': discussion_id}) + current = discussion + for part in path: + current = current.replies[part] + comment = current + +Note that, since the replies to comments are embedded in their parents, +you've have actually retrieved the entire sub-discussion rooted in the +comment you were looking for as well. + +Schema Design: Hybrid +===================== + +Comments in the hybrid format are stored in 'buckets' of about 100 +comments each: + +.. code-block:: python + + { + _id: ObjectId(...), + discussion_id: ObjectId(...), + page: 1, + count: 42, + comments: [ { + slug: '34db', + posted: ISODateTime(...), + author: { id: ObjectId(...), name: 'Rick' }, + text: 'This is so bogus ... ' }, + ... ] + } + +Here, you maintain a "page" of comment data, containing a bit of metadata +about the page (in particular, the page number and the comment count), +as well as the comment bodies themselves. Using a hybrid format actually +makes storing comments hierarchically quite complex, that approach is not covered +in this document. + +Note that in this design, 100 comments is a 'soft' limit to the number +of comments per page, chosen mainly for performance reasons and to +ensure that the comment page never grows beyond the 16MB limit MongoDB +imposes on document size. There may be occasions when the number of +comments is slightly larger than 100, but this does not affect the +correctness of the design. + +Operations: Hybrid +================== + +Here, some common operations that you might need for your CMS are +described in the context of 100-comment "pages". Once again, the examples are +in Python. + +Post a New Comment +------------------ + +In order to post a new comment, you need to ``$push`` the comment onto the +last page and ``$inc`` that page's comment count. If the page has more than 100 +comments, you then must will insert a new page as well. This operation starts +with a reference to the discussion document, and assumes that the discussion +document has a property that tracks the number of +pages: + +.. code-block:: python + + page = db.comment_pages.find_and_modify( + { 'discussion_id': discussion['_id'], + 'page': discussion['num_pages'] }, + { '$inc': { 'count': 1 }, + '$push': { + 'comments': { 'slug': slug, ... } } }, + fields={'count':1}, + upsert=True, + new=True ) + +Note that the ``find_and_modify()`` above is written as an upsert +operation; if MongoDB doesn't findfind the page number, the ``find_and_modify()`` +will create it for you, initialized with appropriate values for ``count`` and +``comments``. Since you're limiting the number of comments per page to around +100, you also need to create new pages as they become necessary: + +.. code-block:: python + + if page['count'] > 100: + db.discussion.update( + { 'discussion_id: discussion['_id'], + 'num_pages': discussion['num_pages'] }, + { '$inc': { 'num_pages': 1 } } ) + +The update here includes the last known number of pages in the query in order to +ensure that you don't have a race condition where the number of pages is +double-incremented, resulting in a nearly or totally empty page. If some +other process has incremented the number of pages in the discussion, +then update above simply does nothing. + +Index Support +~~~~~~~~~~~~~ + +In order to efficiently support the ``find_and_modify()`` and ``update()`` +operations above, you need to maintain a compound index on +(``discussion_id``, ``page``) in the ``comment_pages`` collection: + +.. code-block:: python + + >>> db.comment_pages.ensure_index([ + ... ('discussion_id', 1), ('page', 1)]) + +View the (Paginated) Comments for a Discussion +---------------------------------------------- + +In order to paginate comments with a fixed page size (i.e. not with the 100-ish +number of comments on a database "page"), you need to do +a bit of extra work in Python: + +.. code-block:: python + + def find_comments(discussion_id, skip, limit): + result = [] + page_query = db.comment_pages.find( + { 'discussion_id': discussion_id }, + { 'count': 1, 'comments': { '$slice': [ skip, limit ] } }) + page_query = page_query.sort('page') + for page in page_query: + result += page['comments'] + skip = max(0, skip - page['count']) + limit -= len(page['comments']) + if limit == 0: break + return result + +Here, the ``$slice`` operator is used to pull out comments from each page, +but *only* if the ``skip`` requirement is satisfied. An example helps illustrate +the logic here. Suppose you have 3 pages with 100, 102, +101, and 22 comments on each. respectively. YOu wish to retrieve comments +with skip=300 and limit=50. The algorithm proceeds as follows: + ++-------+-------+-------------------------------------------------------+ +| Skip | Limit | Discussion | ++=======+=======+=======================================================+ +| 300 | 50 | ``{$slice: [ 300, 50 ] }`` matches nothing in page | +| | | #1; subtract page #1's ``count`` from ``skip`` and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 200 | 50 | ``{$slice: [ 200, 50 ] }`` matches nothing in page | +| | | #2; subtract page #2's ``count`` from ``skip`` and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 98 | 50 | ``{$slice: [ 98, 50 ] }`` matches 2 comments in page | +| | | #3; subtract page #3's ``count`` from ``skip`` | +| | | (saturating at 0), subtract 2 from limit, and | +| | | continue. | ++-------+-------+-------------------------------------------------------+ +| 0 | 48 | ``{$slice: [ 0, 48 ] }`` matches all 22 comments in | +| | | page #4; subtract 22 from ``limit`` and continue. | ++-------+-------+-------------------------------------------------------+ +| 0 | 26 | There are no more pages; terminate loop. | ++-------+-------+-------------------------------------------------------+ + +Index Support +~~~~~~~~~~~~~ + +Since you already have an index on (``discussion_id``, ``page``) in your +``comment_pages`` collection, MongoDB can satisfy these queries +efficiently. + +Retrieve a Comment Via Slug ("Permalink") +----------------------------------------- + +Suppose you wish to directly retrieve a comment (e.g. *not* requiring +paging through all preceeding pages of commentary). In this case, you can +use the slug to find the correct page, and then use the application to +find the correct comment: + +.. code-block:: python + + page = db.comment_pages.find_one( + { 'discussion_id': discussion_id, + 'comments.slug': comment_slug}, + { 'comments': 1 }) + for comment in page['comments']: + if comment['slug'] = comment_slug: + break + +Index Support +~~~~~~~~~~~~~ + +Here, you'll need a new index on (``discussion_id``, ``comments.slug``) to +efficiently support retrieving the page number of the comment by slug: + +.. code-block:: python + + >>> db.comment_pages.ensure_index([ + ... ('discussion_id', 1), ('comments.slug', 1)]) + +Sharding +======== + +In each of the cases above, it's likely that your ``discussion_id`` will at +least participate in the shard key if you should choose to shard. + +In the case of the one document per comment approach, it would be nice +to use the ``slug`` (or ``full_slug``, in the case of threaded comments) as +part of the shard key to allow routing of requests by ``slug``: + +.. code-block:: python + + >>> db.command('shardcollection', 'comments', { + ... key : { 'discussion_id' : 1, 'full_slug': 1 } }) + { "collectionsharded" : "comments", "ok" : 1 } + +In the case of the fully-embedded comments, of course, the discussion is +the only thing needed to shard, and its shard key will probably be +determined by concerns outside the scope of this document. + +In the case of hybrid documents, you'll want to use the page number of the +comment page in the shard key as well as the ``discussion_id`` to allow MongoDB +to split popular discussions among different shards: + +.. code-block:: python + + >>> db.command('shardcollection', 'comment_pages', { + ... key : { 'discussion_id' : 1, 'page': 1 } }) + { "collectionsharded" : "comment_pages", "ok" : 1 } + diff --git a/source/tutorial/usecase/ecommerce-category-hierarchy.txt b/source/tutorial/usecase/ecommerce-category-hierarchy.txt new file mode 100644 index 00000000000..9b03bae98f8 --- /dev/null +++ b/source/tutorial/usecase/ecommerce-category-hierarchy.txt @@ -0,0 +1,249 @@ +============================== +E-Commerce: Category Hierarchy +============================== + +Problem +======= + +You have a product hierarchy for an e-commerce site that you want to +query frequently and update somewhat frequently. + +Solution Overview +================= + +This solution keeps each category in its own document, along with a list of its +ancestors. The category hierarchy used in this example will be +based on different categories of music: + +.. figure:: img/ecommerce-category1.png + :align: center + :alt: Initial category hierarchy + + Initial category hierarchy + +Since categories change relatively infrequently, the focus here will be on the +operations needed to keep the hierarchy up-to-date and less on the performance +aspects of updating the hierarchy. + +Schema Design +============= + +Each category in the hierarchy will be represented by a document. That +document will be identified by an ``ObjectId`` for internal +cross-referencing as well as a human-readable name and a url-friendly +``slug`` property. Additionally, the schema stores an ancestors list along +with each document to facilitate displaying a category along with all +its ancestors in a single query. + +.. code-block:: javascript + + { "_id" : ObjectId("4f5ec858eb03303a11000002"), + "name" : "Modal Jazz", + "parent" : ObjectId("4f5ec858eb03303a11000001"), + "slug" : "modal-jazz", + "ancestors" : [ + { "_id" : ObjectId("4f5ec858eb03303a11000001"), + "slug" : "bop", + "name" : "Bop" }, + { "_id" : ObjectId("4f5ec858eb03303a11000000"), + "slug" : "ragtime", + "name" : "Ragtime" } ] + } + +Operations +========== + +Here, the various category manipulations you may need in an ecommerce site are +described as they would occur using the schema above. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. + +Read and Display a Category +--------------------------- + +The simplest operation is reading and displaying a hierarchy. In this +case, you might want to display a category along with a list of "bread +crumbs" leading back up the hierarchy. In an E-commerce site, you'll +most likely have the slug of the category available for your query, as it can be +parsed from the URL. + +.. code-block:: python + + category = db.categories.find( + {'slug':slug}, + {'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 }) + +Here, the slug is used to retrieve the category, fetching only those +fields needed for display. + +Index Support +~~~~~~~~~~~~~ + +In order to support this common operation efficiently, you'll need an index +on the 'slug' field. Since slug is also intended to be unique, the index over it +should be unique as well: + +.. code-block:: python + + db.categories.ensure_index('slug', unique=True) + +Add a Category to the Hierarchy +------------------------------- + +Adding a category to a hierarchy is relatively simple. Suppose you wish +to add a new category 'Swing' as a child of 'Ragtime': + +.. figure:: img/ecommerce-category2.png + :align: center + :alt: Adding a category + + Adding a category + +In this case, the initial insert is simple enough, but after this +insert, the "Swing" category is still missing its ancestors array. To define +this, you'll need a helper function to build the ancestor list: + +.. code-block:: python + + def build_ancestors(_id, parent_id): + parent = db.categories.find_one( + {'_id': parent_id}, + {'name': 1, 'slug': 1, 'ancestors':1}) + parent_ancestors = parent.pop('ancestors') + ancestors = [ parent ] + parent_ancestors + db.categories.update( + {'_id': _id}, + {'$set': { 'ancestors': ancestors } }) + +Note that you only need to travel one level in the hierarchy to get the +ragtime's ancestors and build swing's entire ancestor list. Now you can +actually perform the insert and rebuild the ancestor list: + +.. code-block:: python + + doc = dict(name='Swing', slug='swing', parent=ragtime_id) + swing_id = db.categories.insert(doc) + build_ancestors(swing_id, ragtime_id) + +Index Support +~~~~~~~~~~~~~ + +Since these queries and updates all selected based on ``_id``, you only need +the default MongoDB-supplied index on ``_id`` to support this operation +efficiently. + +Change the Ancestry of a Category +--------------------------------- + +Suppose you wish to reorganize the hierarchy by moving 'bop' under +'swing': + +.. figure:: img/ecommerce-category3.png + :align: center + :alt: Change the parent of a category + + Change the parent of a category + +The initial update is straightforward: + +.. code-block:: python + + db.categories.update( + {'_id':bop_id}, {'$set': { 'parent': swing_id } } ) + +Now, you still need to update the ancestor list for bop and all its +descendants. In this case, you can't guarantee that the ancestor list of +the parent category is always correct, since MongoDB may +process the categories out-of-order. To handle this, you'll need a new +ancestor-building function: + +.. code-block:: python + + def build_ancestors_full(_id, parent_id): + ancestors = [] + while parent_id is not None: + parent = db.categories.find_one( + {'_id': parent_id}, + {'parent': 1, 'name': 1, 'slug': 1, 'ancestors':1}) + parent_id = parent.pop('parent') + ancestors.append(parent) + db.categories.update( + {'_id': _id}, + {'$set': { 'ancestors': ancestors } }) + +Now, at the expense of a few more queries up the hierarchy, you can +easily reconstruct all the descendants of 'bop': + +.. code-block:: python + + for cat in db.categories.find( + {'ancestors._id': bop_id}, + {'parent_id': 1}): + build_ancestors_full(cat['_id'], cat['parent_id']) + +Index Support +~~~~~~~~~~~~~ + +In this case, an index on ``ancestors._id`` would be helpful in +determining which descendants need to be updated: + +.. code-block:: python + + db.categories.ensure_index('ancestors._id') + +Rename a Category +----------------- + +Renaming a category would normally be an extremely quick operation, but +in this case due to denormalization, you also need to update the +descendants. Suppose you need to rename "Bop" to "BeBop:" + +.. figure:: img/ecommerce-category4.png + :align: center + :alt: Rename a category + + Rename a category + +First, you need to update the category name itself: + +.. code-block:: python + + db.categories.update( + {'_id':bop_id}, {'$set': { 'name': 'BeBop' } } ) + +Next, you need to update each descendant's ancestors list: + +.. code-block:: python + + db.categories.update( + {'ancestors._id': bop_id}, + {'$set': { 'ancestors.$.name': 'BeBop' } }, + multi=True) + +Here, you can use the positional operation ``$`` to match the exact "ancestor" +entry that matches the query, as well as the ``multi`` option on the +update to ensure the rename operation occurs in a single server +round-trip. + +Index Support +~~~~~~~~~~~~~ + +In this case, the index you have already defined on ``ancestors._id`` is +sufficient to ensure good performance. + +Sharding +======== + +In this solution, it is unlikely that you would want to shard the +collection since it's likely to be quite small. If you *should* decide to +shard, the use of an ``_id`` field for most updates makes it an +ideal sharding candidate. The sharding commands you'd use to shard +the category collection would then be the following: + +.. code-block:: python + + >>> db.command('shardcollection', 'categories') + { "collectionsharded" : "categories", "ok" : 1 } + +Note that there is no need to specify the shard key, as MongoDB will +default to using ``_id`` as a shard key. diff --git a/source/tutorial/usecase/ecommerce-inventory-management.txt b/source/tutorial/usecase/ecommerce-inventory-management.txt new file mode 100644 index 00000000000..0c1b065a27e --- /dev/null +++ b/source/tutorial/usecase/ecommerce-inventory-management.txt @@ -0,0 +1,392 @@ +================================ +E-Commerce: Inventory Management +================================ + +Problem +======= + +You have a product catalog and you would like to maintain an accurate +inventory count as users shop your online store, adding and removing +things from their cart. + +Solution Overview +================= + +In an ideal world, consumers would begin browsing an online store, add +items to their shopping cart, and proceed in a timely manner to checkout +where their credit cards would always be successfully validated and +charged. In the real world, however, customers often add or remove items +from their shopping cart, change quantities, abandon the cart, and have +problems at checkout time. + +This solution keeps the traditional metaphor of the shopping cart, but +the shopping cart will *age* . Once a shopping cart has not been active +for a certain period of time, all the items in the cart once again +become part of available inventory and the cart is cleared. The state +transition diagram for a shopping cart is below: + +.. figure:: img/ecommerce-inventory1.png + :align: center + :alt: + +Schema Design +============= + +In your inventory collection, you need to maintain the current available +inventory of each stock-keeping unit (SKU) as well as a list of 'carted' +items that may be released back to available inventory if their shopping +cart times out: + +.. code-block:: javascript + + { + _id: '00e8da9b', + qty: 16, + carted: [ + { qty: 1, cart_id: 42, + timestamp: ISODate("2012-03-09T20:55:36Z"), }, + { qty: 2, cart_id: 43, + timestamp: ISODate("2012-03-09T21:55:36Z"), }, + ] + } + +(Note that, while in an actual implementation, you might choose to merge +this schema with the product catalog schema described in +:doc:`E-Commerce: Product Catalog `, the inventory +schema is simplified here for brevity.) Continuing the metaphor of the +brick-and-mortar store, your SKU above has 16 items on the shelf, 1 in one cart, +and 2 in another for a total of 19 unsold items of merchandise. + +For the shopping cart model, you need to maintain a list of (``sku``, +``quantity``, ``price``) line items: + +.. code-block:: javascript + + { + _id: 42, + last_modified: ISODate("2012-03-09T20:55:36Z"), + status: 'active', + items: [ + { sku: '00e8da9b', qty: 1, item_details: {...} }, + { sku: '0ab42f88', qty: 4, item_details: {...} } + ] + } + +Note that the cart model includes item details in each line +item. This allows your app to display the contents of the cart to the user +without needing a second query back to the catalog collection to fetch +the details. + +Operations +========== + +Here, the various inventory-related operations in an ecommerce site are described +as they would occur using the schema above. The examples use the Python +programming language and the ``pymongo`` MongoDB driver, but implementations +would be similar in other languages as well. + +Add an Item to a Shopping Cart +------------------------------ + +The most basic operation is moving an item off the "shelf" in to the +"cart." The constraint is that you would like to guarantee that you never +move an unavailable item off the shelf into the cart. To solve this +problem, this solution ensures that inventory is only updated if there is +sufficient inventory to satisfy the request: + +.. code-block:: python + + def add_item_to_cart(cart_id, sku, qty, details): + now = datetime.utcnow() + + # Make sure the cart is still active and add the line item + result = db.cart.update( + {'_id': cart_id, 'status': 'active' }, + { '$set': { 'last_modified': now }, + '$push': + 'items': {'sku': sku, 'qty':qty, 'details': details } + }, + safe=True) + if not result['updatedExisting']: + raise CartInactive() + + # Update the inventory + result = db.inventory.update( + {'_id':sku, 'qty': {'$gte': qty}}, + {'$inc': {'qty': -qty}, + '$push': { + 'carted': { 'qty': qty, 'cart_id':cart_id, + 'timestamp': now } } }, + safe=True) + if not result['updatedExisting']: + # Roll back our cart update + db.cart.update( + {'_id': cart_id }, + { '$pull': { 'items': {'sku': sku } } } + ) + raise InadequateInventory() + +Note here in particular that the system does not trust that the request is +satisfiable. The first check makes sure that the cart is still "active" +(more on inactive carts below) before adding a line item. The next check +verifies that sufficient inventory exists to satisfy the request before +decrementing inventory. In the case of inadequate inventory, the system +*compensates* for the non-transactional nature of MongoDB by removing the +cart update. Using safe=True and checking the result in the case of +these two updates allows you to report back an error to the user if the +cart has become inactive or available quantity is insufficient to +satisfy the request. + +Index Support +~~~~~~~~~~~~~ + +To support this query efficiently, all you really need is an index on +``_id``, which MongoDB provides us by default. + +Modifying the Quantity in the Cart +---------------------------------- + +Here, you'd like to allow the user to adjust the quantity of items in their +cart. The system must ensure that when they adjust the quantity upward, there +is sufficient inventory to cover the quantity, as well as updating the +particular ``carted`` entry for the user's cart. + +.. code-block:: python + + def update_quantity(cart_id, sku, old_qty, new_qty): + now = datetime.utcnow() + delta_qty = new_qty - old_qty + + # Make sure the cart is still active and add the line item + result = db.cart.update( + {'_id': cart_id, 'status': 'active', 'items.sku': sku }, + {'$set': { + 'last_modified': now, + 'items.$.qty': new_qty }, + }, + safe=True) + if not result['updatedExisting']: + raise CartInactive() + + # Update the inventory + result = db.inventory.update( + {'_id':sku, + 'carted.cart_id': cart_id, + 'qty': {'$gte': delta_qty} }, + {'$inc': {'qty': -delta_qty }, + '$set': { 'carted.$.qty': new_qty, 'timestamp': now } }, + safe=True) + if not result['updatedExisting']: + # Roll back our cart update + db.cart.update( + {'_id': cart_id, 'items.sku': sku }, + {'$set': { 'items.$.qty': old_qty } + }) + raise InadequateInventory() + +Note in particular here the use of the positional operator '$' to +update the particular ``carted`` entry and line item that matched for the +query. This allows the system to update the inventory and keep track of the data +necessary need to "rollback" the cart in a single atomic operation. The code above +also ensures the cart is active and timestamp it as in the case of adding +items to the cart. + +Index Support +~~~~~~~~~~~~~ + +To support this query efficiently, again all we really need is an index on ``_id``. + +Checking Out +------------ + +During checkout, you'd like to validate the method of payment and remove +the various ``carted`` items after the transaction has succeeded. + +.. code-block:: python + + def checkout(cart_id): + now = datetime.utcnow() + + # Make sure the cart is still active and set to 'pending'. Also + # fetch the cart details so we can calculate the checkout price + cart = db.cart.find_and_modify( + {'_id': cart_id, 'status': 'active' }, + update={'$set': { 'status': 'pending','last_modified': now } } ) + if cart is None: + raise CartInactive() + + # Validate payment details; collect payment + try: + collect_payment(cart) + db.cart.update( + {'_id': cart_id }, + {'$set': { 'status': 'complete' } } ) + db.inventory.update( + {'carted.cart_id': cart_id}, + {'$pull': {'cart_id': cart_id} }, + multi=True) + except: + db.cart.update( + {'_id': cart_id }, + {'$set': { 'status': 'active' } } ) + raise + +Here, the cart is first "locked" by setting its status to "pending" +(disabling any modifications.) Then the system collects payment data, verifying +at the same time that the cart is still active. MongoDB's +``findAndModify`` command is used to atomically update the cart and return its +details so you can capture payment information. If the payment is +successful, you then remove the ``carted`` items from individual items' +inventory and set the cart to "complete." If payment is unsuccessful, you +unlock the cart by setting its status back to "active" and report a +payment error. + +Index Support +~~~~~~~~~~~~~ + +Once again the ``_id`` default index is enough to make this operation efficient. + +Returning Timed-Out Items to Inventory +-------------------------------------- + +Periodically, you'd like to expire carts that have been inactive for a +given number of seconds, returning their line items to available +inventory: + +.. code-block:: python + + def expire_carts(timeout): + now = datetime.utcnow() + threshold = now - timedelta(seconds=timeout) + + # Lock and find all the expiring carts + db.cart.update( + {'status': 'active', 'last_modified': { '$lt': threshold } }, + {'$set': { 'status': 'expiring' } }, + multi=True ) + + # Actually expire each cart + for cart in db.cart.find({'status': 'expiring'}): + + # Return all line items to inventory + for item in cart['items']: + db.inventory.update( + { '_id': item['sku'], + 'carted.cart_id': cart['id'], + 'carted.qty': item['qty'] + }, + {'$inc': { 'qty': item['qty'] }, + '$pull': { 'carted': { 'cart_id': cart['id'] } } }) + + db.cart.update( + {'_id': cart['id'] }, + {'$set': { status': 'expired' }) + +Here, you first find all carts to be expired and then, for each cart, +return its items to inventory. Once all items have been returned to +inventory, the cart is moved to the 'expired' state. + +Index Support +~~~~~~~~~~~~~ + +In this case, you need to be able to efficiently query carts based on +their ``status`` and ``last_modified`` values, so an index on these would help +the performance of the periodic expiration process: + +.. code-block:: python + + >>> db.cart.ensure_index([('status', 1), ('last_modified', 1)]) + +Note in particular the order in which the index is defined: in order to +efficiently support range queries ('$lt' in this case), the ranged item +must be the last item in the index. Also note that there is no need to +define an index on the ``status`` field alone, as any queries for status +can use the compound index we have defined here. + +Error Handling +-------------- + +There is one failure mode above that thusfar has not been handled adequately: the +case of an exception that occurs after updating the inventory collection +but before updating the shopping cart. The result of this failure mode +is a shopping cart that may be absent or expired where the 'carted' +items in the inventory have not been returned to available inventory. To +account for this case, you'll need to run a cleanup method periodically that +will find old ``carted`` items and check the status of their cart: + +.. code-block:: python + + def cleanup_inventory(timeout): + now = datetime.utcnow() + threshold = now - timedelta(seconds=timeout) + + # Find all the expiring carted items + for item in db.inventory.find( + {'carted.timestamp': {'$lt': threshold }}): + + # Find all the carted items that matched + carted = dict( + (carted_item['cart_id'], carted_item) + for carted_item in item['carted'] + if carted_item['timestamp'] < threshold) + + # Find any carts that are active and refresh the carted items + for cart in db.cart.find( + { '_id': {'$in': carted.keys() }, + 'status':'active'}): + cart = carted[cart['_id']] + db.inventory.update( + { '_id': item['_id'], + 'carted.cart_id': cart['_id'] }, + { '$set': {'carted.$.timestamp': now } }) + del carted[cart['_id']] + + # All the carted items left in the dict need to now be + # returned to inventory + for cart_id, carted_item in carted.items(): + db.inventory.update( + { '_id': item['_id'], + 'carted.cart_id': cart_id, + 'carted.qty': carted_item['qty'] }, + { '$inc': { 'qty': carted_item['qty'] }, + '$pull': { 'carted': { 'cart_id': cart_id } } }) + +Note that the function above is safe, as it checks to be sure the cart +is expired or expiring before removing items from the cart and returning +them to inventory. This function could, however, be slow as well as +slowing down other updates and queries, so it should be used +infrequently. + +Sharding +======== + +If you choose to shard this system, the use of an ``_id`` field for most of +our updates makes ``_id`` an ideal sharding candidate, for both carts and +products. Using ``_id`` as your shard key allows all updates that query on +``_id`` to be routed to a single mongod process. There are two potential +drawbacks with using ``_id`` as a shard key, however. + +- If the cart collection's ``_id`` is generated in a generally increasing + order, new carts will all initially be assigned to a single shard. +- Cart expiration and inventory adjustment requires several broadcast + queries and updates if ``_id`` is used as a shard key. + +It turns out you can mitigate the first pitfall by choosing a random +value (perhaps the sha-1 hash of an ``ObjectId``) as the ``_id`` of each cart +as it is created. The second objection is valid, but relatively +unimportant, as the expiration function runs relatively infrequently and can be +slowed down by the judicious use of ``sleep()`` calls in order to +minimize server load. + +The sharding commands you'd use to shard the cart and inventory +collections, then, would be the following: + +.. code-block:: python + + >>> db.command('shardcollection', 'inventory') + { "collectionsharded" : "inventory", "ok" : 1 } + >>> db.command('shardcollection', 'cart') + { "collectionsharded" : "cart", "ok" : 1 } + +Note that there is no need to specify the shard key, as MongoDB will +default to using ``_id`` as a shard key. diff --git a/source/tutorial/usecase/ecommerce-product-catalog.txt b/source/tutorial/usecase/ecommerce-product-catalog.txt new file mode 100644 index 00000000000..d225bfd2b67 --- /dev/null +++ b/source/tutorial/usecase/ecommerce-product-catalog.txt @@ -0,0 +1,516 @@ +=========================== +E-Commerce: Product Catalog +=========================== + +Problem +======= + +You have a product catalog that you would like to store in MongoDB with +products of various types and various relevant attributes. + +Solution Overview +================= + +In the relational database world, there are several solutions of varying +performance characteristics used to solve this problem. This section +examines a few options and then describes the solution enabled by MongoDB. + +One approach ("concrete table inheritance") to solving this problem is +to create a table for each product category: + +.. code-block:: sql + + CREATE TABLE `product_audio_album` ( + `sku` char(8) NOT NULL, + ... + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + ..., + PRIMARY KEY(`sku`)) + ... + CREATE TABLE `product_film` ( + `sku` char(8) NOT NULL, + ... + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + ..., + PRIMARY KEY(`sku`)) + ... + +The main problem with this approach is a lack of flexibility. Each time +you add a new product category, you need to create a new table. +Furthermore, queries must be tailored to the exact type of product +expected. + +Another approach ("single table inheritance") is to use a single +table for all products and add new columns each time you need to store +a new type of product: + +.. code-block:: sql + + CREATE TABLE `product` ( + `sku` char(8) NOT NULL, + ... + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + ... + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + ..., + PRIMARY KEY(`sku`)) + +This is more flexible, allowing queries to span different types of +product, but it's quite wasteful of space. One possible space +optimization would be to name the columns generically (``str_0``, ``str_1``, +etc.,) but then you lose visibility into the meaning of the actual data in +the columns. + +Multiple table inheritance is yet another approach where common attributes are +represented in a generic 'product' table and the variations in individual +category product tables: + +.. code-block:: sql + + CREATE TABLE `product` ( + `sku` char(8) NOT NULL, + `title` varchar(255) DEFAULT NULL, + `description` varchar(255) DEFAULT NULL, + `price`, ... + PRIMARY KEY(`sku`)) + + CREATE TABLE `product_audio_album` ( + `sku` char(8) NOT NULL, + ... + `artist` varchar(255) DEFAULT NULL, + `genre_0` varchar(255) DEFAULT NULL, + `genre_1` varchar(255) DEFAULT NULL, + ..., + PRIMARY KEY(`sku`), + FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) + ... + CREATE TABLE `product_film` ( + `sku` char(8) NOT NULL, + ... + `title` varchar(255) DEFAULT NULL, + `rating` char(8) DEFAULT NULL, + ..., + PRIMARY KEY(`sku`), + FOREIGN KEY(`sku`) REFERENCES `product`(`sku`)) + ... + +This is more space-efficient than single-table inheritance and somewhat +more flexible than concrete-table inheritance, but it does require a +minimum of one join to actually obtain all the attributes relevant to a +product. + +Entity-attribute-value schemas are yet another solution, basically +creating a meta-model for your product data. In this approach, you +maintain a table with (``entity_id``, ``attribute_id``, ``value``) triples that +describe each product. For instance, suppose you are describing an audio +album. In that case you might have a series of rows representing the +following relationships: + ++-----------------+-------------+------------------+ +| Entity | Attribute | Value | ++=================+=============+==================+ +| sku_00e8da9b | type | Audio Album | ++-----------------+-------------+------------------+ +| sku_00e8da9b | title | A Love Supreme | ++-----------------+-------------+------------------+ +| sku_00e8da9b | ... | ... | ++-----------------+-------------+------------------+ +| sku_00e8da9b | artist | John Coltrane | ++-----------------+-------------+------------------+ +| sku_00e8da9b | genre | Jazz | ++-----------------+-------------+------------------+ +| sku_00e8da9b | genre | General | ++-----------------+-------------+------------------+ +| ... | ... | ... | ++-----------------+-------------+------------------+ + +This schema has the advantage of being completely flexible; any entity +can have any set of any attributes. New product categories do not +require *any* changes in the DDL for your database. The downside to this +schema is that any nontrivial query requires large numbers of join +operations, which results in a large performance penalty. + +One other approach that has been used in relational world is to "punt" +so to speak on the product details and serialize them all into a ``BLOB`` +column. The problem with this approach is that the details become +difficult to search and sort by. (One exception is with Oracle's ``XMLTYPE`` +columns, which actually resemble a NoSQL document database.) + +The approach best suited to MongoDB is to use a single collection to store all +the product data, similar to single-table inheritance. Due to MongoDB's +dynamic schema, however, you need not conform each document to the same +schema. This allows you to tailor each product's document to only contain +attributes relevant to that product category. + +Schema Design +============= + +Your schema should contain general product information that needs to be +searchable across all products at the beginning of each document, with +properties that vary from category to category encapsulated in a +'details' property. Thus an audio album might look like the following: + +.. code-block:: javascript + + { + sku: "00e8da9b", + type: "Audio Album", + title: "A Love Supreme", + description: "by John Coltrane", + asin: "B0000A118M", + + shipping: { + weight: 6, + dimensions: { + width: 10, + height: 10, + depth: 1 + }, + }, + + pricing: { + list: 1200, + retail: 1100, + savings: 100, + pct_savings: 8 + }, + + details: { + title: "A Love Supreme [Original Recording Reissued]", + artist: "John Coltrane", + genre: [ "Jazz", "General" ], + ... + tracks: [ + "A Love Supreme Part I: Acknowledgement", + "A Love Supreme Part II - Resolution", + "A Love Supreme, Part III: Pursuance", + "A Love Supreme, Part IV-Psalm" + ], + }, + } + +A movie title would have the same fields stored for general product +information, shipping, and pricing, but have quite a different details +attribute: + +.. code-block:: javascript + + { + sku: "00e8da9d", + type: "Film", + ..., + asin: "B000P0J0AQ", + + shipping: { ... }, + + pricing: { ... }, + + details: { + title: "The Matrix", + director: [ "Andy Wachowski", "Larry Wachowski" ], + writer: [ "Andy Wachowski", "Larry Wachowski" ], + ..., + aspect_ratio: "1.66:1" + }, + } + +Another thing to note in the MongoDB schema is that you can have +multi-valued attributes without any arbitrary restriction on the number +of attributes (as you might have if you had ``genre_0`` and ``genre_1`` +columns in a relational database, for instance) and without the need for +a join (as you might have if you normalized the many-to-many "genre" +relation). + +Operations +========== + +You'll be primarily using the product catalog mainly to perform search +operations. Thus the focus in this section will be on the various types +of queries you might want to support in an e-commerce site. These +examples will be written in the Python programming language using the +``pymongo`` driver, but other language/driver combinations should be +similar. + +Find All Jazz Albums, Sorted by Year Produced +--------------------------------------------- + +Here, you'd like to see a group of products with a particular genre, +sorted by the year in which they were produced: + +.. code-block:: python + + query = db.products.find({'type':'Audio Album', + 'details.genre': 'jazz'}) + query = query.sort([('details.issue_date', -1)]) + +Index Support +~~~~~~~~~~~~~ + +In order to efficiently support this type of query, you need to create a +compound index on all the properties used in the filter and in the sort: + +.. code-block:: python + + db.products.ensure_index([ + ('type', 1), + ('details.genre', 1), + ('details.issue_date', -1)]) + +Note here that the final component of the index is the sort field. This allows +MongoDB to traverse the index in the order in which the data is to be returned, +rather than performing a slow in-memory sort of the data. + +Find All Products Sorted by Percentage Discount Descending +---------------------------------------------------------- + +While most searches would be for a particular type of product (audio +album or movie, for instance), there may be cases where you'd like to +find all products in a certain price range, perhaps for a "best daily +deals" of your website. In this case, you'll use the pricing information +that exists in all products to find the products with the highest +percentage discount: + +.. code-block:: python + + query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 }) + query = query.sort([('pricing.pct_savings', -1)]) + +Index Support +~~~~~~~~~~~~~ + +In order to efficiently support this type of query, you'll need an index on the +percentage savings: + +.. code-block:: python + + db.products.ensure_index('pricing.pct_savings') + +Since the index is only on a single key, it does not matter in which +order the index is sorted. Note that, had you wanted to perform a range +query (say all products over $25 retail) and sort by another property +(perhaps percentage savings), MongoDB would not have been able to use an +index as effectively. Range queries or sorts must always be the *last* +property in a compound index in order to avoid scanning entirely. Thus +using a different property for a range query and a sort requires some +degree of scanning, slowing down your query. + +Find All Movies in Which Keanu Reeves Acted +------------------------------------------- + +In this case, you want to search inside the details of a particular type +of product (a movie) to find all movies containing Keanu Reeves, sorted +by date descending: + +.. code-block:: python + + query = db.products.find({'type': 'Film', + 'details.actor': 'Keanu Reeves'}) + query = query.sort([('details.issue_date', -1)]) + +Index Support +~~~~~~~~~~~~~ + +Here, you wish to once again index by type first, followed the details +you're interested in: + +.. code-block:: python + + db.products.ensure_index([ + ('type', 1), + ('details.actor', 1), + ('details.issue_date', -1)]) + +And once again, the final component of the index is the sort field. + +Find All Movies With the Word "Hacker" in the Title +--------------------------------------------------- + +Those experienced with relational databases may shudder at this +operation, since it implies an inefficient LIKE query. In fact, without +a full-text search engine, some scanning will always be required to +satisfy this query. In the case of MongoDB, the solution is to use a regular +expression. In Python, you can use the ``re`` module to construct the query: + +.. code-block:: python + + import re + re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE) + + + query = db.products.find({'type': 'Film', 'title': re_hacker}) + query = query.sort([('details.issue_date', -1)]) + +Although this is fairly convenient, MongoDB also provides the option to +use a special syntax rather than importing the Python ``re`` module: + +.. code-block:: python + + query = db.products.find({ + 'type': 'Film', + 'title': {'$regex': '.*hacker.*', '$options':'i'}}) + query = query.sort([('details.issue_date', -1)]) + +Index Support +~~~~~~~~~~~~~ + +Here, the best index diverges a bit from the previous index orders: + +.. code-block:: python + + db.products.ensure_index([ + ('type', 1), + ('details.issue_date', -1), + ('title', 1)]) + +You may be wondering why you should include the title field in the index +if MongoDB has to scan anyway. The reason is that there are two types of +scans: index scans and document scans. Document scans require entire +documents to be loaded into memory, while index scans only require index +entries to be loaded. So while an index scan on title isn't as efficient +as a direct lookup, it is certainly faster than a document scan. + +The order in which you include the index keys is also different than what +you might expect. This is once again due to the fact that you're +scanning. Since the results need to be in sorted order by +``'details.issue_date``, you should make sure that's the order in which +MongoDB scans titles. You can observe the difference looking at the +query plans for different orderings. If you use the (``type``, ``title``, +``details.issue_date``) index, you get the following plan: + +.. code-block:: python + :emphasize-lines: 11,17 + + {u'allPlans': [...], + u'cursor': u'BtreeCursor type_1_title_1_details.issue_date_-1 multi', + u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1}, + {u'$minElement': 1}]], + u'title': [[u'', {}], + [<_sre.SRE_Pattern object at 0x2147cd8>, + <_sre.SRE_Pattern object at 0x2147cd8>]], + u'type': [[u'Film', u'Film']]}, + u'indexOnly': False, + u'isMultiKey': False, + u'millis': 208, + u'n': 0, + u'nChunkSkips': 0, + u'nYields': 0, + u'nscanned': 10000, + u'nscannedObjects': 0, + u'scanAndOrder': True} + +If, however, you use the (``type``, ``details.issue_date``, ``title``) index, you get +the following plan: + +.. code-block:: python + :emphasize-lines: 11 + + {u'allPlans': [...], + u'cursor': u'BtreeCursor type_1_details.issue_date_-1_title_1 multi', + u'indexBounds': {u'details.issue_date': [[{u'$maxElement': 1}, + {u'$minElement': 1}]], + u'title': [[u'', {}], + [<_sre.SRE_Pattern object at 0x2147cd8>, + <_sre.SRE_Pattern object at 0x2147cd8>]], + u'type': [[u'Film', u'Film']]}, + u'indexOnly': False, + u'isMultiKey': False, + u'millis': 157, + u'n': 0, + u'nChunkSkips': 0, + u'nYields': 0, + u'nscanned': 10000, + u'nscannedObjects': 0} + +The two salient features to note are a) the absence of the +``scanAndOrder: True`` in the optmal query and b) the difference in time +(208ms for the suboptimal query versus 157ms for the optimal one). The +lesson learned here is that if you absolutely have to scan, you should +make the elements you're scanning the *least* significant part of the +index (even after the sort). + +Sharding +======== + +Though the performance in this system is highly dependent on the indexes, +sharding can enhance that performance further by allowing +MongoDB to keep larger portions of those indexes in RAM. In order to maximize +your read scaling, it's also nice to choose a shard key that allows +mongos to route queries to only one or a few shards rather than all the +shards globally. + +Since most of the queries in this system include type, it should probably be +included in the shard key. You may note that most of the +queries also included ``details.issue_date``, so there may be a +temptation to include it in the shard key, but this actually wouldn't +help much since none of the queries were *selective* by date. + +Since this schema is so flexible, it's hard to say *a priori* what the +ideal shard key would be, but a reasonable guess would be to include the +``type`` field, one or more detail fields that are commonly queried, and +one final random-ish field to ensure you don't get large unsplittable +chunks. For this example, assuming that ``details.genre`` is the +second-most queried field after ``type``, the sharding setup +would be as follows: + +.. code-block:: python + + >>> db.command('shardcollection', 'product', { + ... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } }) + { "collectionsharded" : "details.genre", "ok" : 1 } + +One important note here is that, even if you choose a shard key that +requires all queries to be broadcast to all shards, you still get some +benefits from sharding due to a) the larger amount of memory available +to store indexes and b) the fact that searches will be parallelized +across shards, reducing search latency. + +Scaling Queries with ``read_preference`` +---------------------------------------- + +Although sharding is the best way to scale reads and writes, it's not +always possible to partition your data so that the queries can be routed +by mongos to a subset of shards. In this case, ``mongos`` will broadcast the +query to all shards and then accumulate the results before returning to +the client. In cases like this, you can still scale query performance +by allowing ``mongos`` to read from the secondary servers in a replica set. +This is achieved via the ``read_preference`` argument, and can be set at +the connection or individual query level. For instance, to allow all +reads on a connection to go to a secondary, the syntax is: + +.. code-block:: python + + conn = pymongo.Connection(read_preference=pymongo.SECONDARY) + +or + +.. code-block:: python + + conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY) + +In the first instance, reads will be distributed among all the +secondaries and the primary, whereas in the second reads will only be +sent to the secondary. To allow queries to go to a secondary on a +per-query basis, you can also specify a ``read_preference``: + +.. code-block:: python + + results = db.product.find(..., read_preference=pymongo.SECONDARY) + +or + +.. code-block:: python + + results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY) + +It is important to note that reading from a secondary can introduce a +lag between when inserts and updates occur and when they become visible +to queries. In the case of a product catalog, however, where queries +happen frequently and updates happen infrequently, such eventual +consistency (updates visible within a few seconds but not immediately) +is usually tolerable. diff --git a/source/tutorial/usecase/img/ecommerce-category1.png b/source/tutorial/usecase/img/ecommerce-category1.png new file mode 100644 index 00000000000..0483e3785be Binary files /dev/null and b/source/tutorial/usecase/img/ecommerce-category1.png differ diff --git a/source/tutorial/usecase/img/ecommerce-category2.png b/source/tutorial/usecase/img/ecommerce-category2.png new file mode 100644 index 00000000000..399f29b81d6 Binary files /dev/null and b/source/tutorial/usecase/img/ecommerce-category2.png differ diff --git a/source/tutorial/usecase/img/ecommerce-category3.png b/source/tutorial/usecase/img/ecommerce-category3.png new file mode 100644 index 00000000000..472fb06555e Binary files /dev/null and b/source/tutorial/usecase/img/ecommerce-category3.png differ diff --git a/source/tutorial/usecase/img/ecommerce-category4.png b/source/tutorial/usecase/img/ecommerce-category4.png new file mode 100644 index 00000000000..6ec01d17af0 Binary files /dev/null and b/source/tutorial/usecase/img/ecommerce-category4.png differ diff --git a/source/tutorial/usecase/img/ecommerce-inventory1.png b/source/tutorial/usecase/img/ecommerce-inventory1.png new file mode 100644 index 00000000000..3ac5a359de1 Binary files /dev/null and b/source/tutorial/usecase/img/ecommerce-inventory1.png differ diff --git a/source/tutorial/usecase/img/rta-hierarchy1.png b/source/tutorial/usecase/img/rta-hierarchy1.png new file mode 100644 index 00000000000..68cf17dfcb6 Binary files /dev/null and b/source/tutorial/usecase/img/rta-hierarchy1.png differ diff --git a/source/tutorial/usecase/img/rta-preagg1.png b/source/tutorial/usecase/img/rta-preagg1.png new file mode 100644 index 00000000000..8a5b2d27532 Binary files /dev/null and b/source/tutorial/usecase/img/rta-preagg1.png differ diff --git a/source/tutorial/usecase/img/rta-preagg2.png b/source/tutorial/usecase/img/rta-preagg2.png new file mode 100644 index 00000000000..a8af8112f0e Binary files /dev/null and b/source/tutorial/usecase/img/rta-preagg2.png differ diff --git a/source/tutorial/usecase/index.txt b/source/tutorial/usecase/index.txt new file mode 100644 index 00000000000..b517cf60214 --- /dev/null +++ b/source/tutorial/usecase/index.txt @@ -0,0 +1,14 @@ +Use Cases +============ + +.. toctree:: + :maxdepth: 1 + + real-time-analytics-storing-log-data + real-time-analytics-preaggregated-reports + real-time-analytics-hierarchical-aggregation + ecommerce-product-catalog + ecommerce-inventory-management + ecommerce-category-hierarchy + cms-metadata-and-asset-management + cms-storing-comments diff --git a/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt new file mode 100644 index 00000000000..7f149827e48 --- /dev/null +++ b/source/tutorial/usecase/real-time-analytics-hierarchical-aggregation.txt @@ -0,0 +1,513 @@ +============================================= +Real Time Analytics: Hierarchical Aggregation +============================================= + +Problem +======= + +You have a large amount of event data that you want to analyze at +multiple levels of aggregation. + +Solution Overview +================= + +This solution assumes that the incoming event data is already +stored in an incoming ``events`` collection. For details on how you might +get the event data into the events collection, please see :doc:`Real Time +Analytics: Storing Log Data `. + +Once the event data is in the events collection, you need to aggregate +event data to the finest time granularity you're interested in. Once that +data is aggregated, you'll use it to aggregate up to the next level of +the hierarchy, and so on. To perform the aggregations, you'll use +MongoDB's ``mapreduce`` command. The schema will use several collections: +the raw data (event) logs and collections for statistics aggregated +hourly, daily, weekly, monthly, and yearly. This solution uses a hierarchical +approach to running your map-reduce jobs. The input and output of each +job is illustrated below: + +.. figure:: img/rta-hierarchy1.png + :align: center + :alt: Hierarchy + + Hierarchy of statistics collected + +Note that the events rolling into the hourly collection is qualitatively +different than the hourly statistics rolling into the daily collection. + +.. note:: + + **Map/reduce** is a popular aggregation algorithm that is optimized for + embarrassingly parallel problems. The psuedocode (in Python) of the + map/reduce algorithm appears below. Note that this psuedocode is for a + particular type of map/reduce where the results of the + map/reduce operation are *reduced* into the result collection, allowing + you to perform incremental aggregation which you'll need in this case. + + .. code-block:: python + + def map_reduce(icollection, query, + mapf, reducef, finalizef, ocollection): + '''Psuedocode for map/reduce with output type="reduce" in MongoDB''' + map_results = defaultdict(list) + def emit(key, value): + '''helper function used inside mapf''' + map_results[key].append(value) + + + # The map phase + for doc in icollection.find(query): + mapf(doc) + + + # Pull in documents from the output collection for + # output type='reduce' + for doc in ocollection.find({'_id': {'$in': map_results.keys() } }): + map_results[doc['_id']].append(doc['value']) + + + # The reduce phase + for key, values in map_results.items(): + reduce_results[key] = reducef(key, values) + + + # Finalize and save the results back + for key, value in reduce_results.items(): + final_value = finalizef(key, value) + ocollection.save({'_id': key, 'value': final_value}) + + The embarrassingly parallel part of the map/reduce algorithm lies in the + fact that each invocation of mapf, reducef, and finalizef are + independent of each other and can, in fact, be distributed to different + servers. In the case of MongoDB, this parallelism can be achieved by + using sharding on the collection on you're are performing map/reduce. + +Schema Design +============= + +When designing the schema for event storage, it's important to track whichevents +which have been included in your aggregations and events which have not yet been +included. A simple approach in a relational database would be to use an auto-increment +integer primary key, but this introduces a big performance penalty to +your event logging process as it has to fetch event keys one-by one. + +If you're able to batch up your inserts into the event table, you can +still use an auto-increment primary key by using the ``find_and_modify`` +command to generate your ``_id`` values: + +.. code-block:: python + + >>> obj = db.my_sequence.find_and_modify( + ... query={'_id':0}, + ... update={'$inc': {'inc': 50}} + ... upsert=True, + ... new=True) + >>> batch_of_ids = range(obj['inc']-50, obj['inc']) + +In most cases, however, it's sufficient to include a timestamp with +each event that you can use as a marker of which events have been +processed and which ones remain to be processed. + +This use case assumes that you +are calculating average session length for +logged-in users on a website. Your event format will thus be the +following: + +.. code-block:: javascript + + { + "userid": "rick", + "ts": ISODate('2010-10-10T14:17:22Z'), + "length":95 + } + +You want to calculate total and average session times for each user at +the hour, day, week, month, and year. In each case, you will also store +the number of sessions to enable MongoDB to incrementally recompute the +average session times. Each of your aggregate documents, then, looks like +the following: + +.. code-block:: javascript + + { + _id: { u: "rick", d: ISODate("2010-10-10T14:00:00Z") }, + value: { + ts: ISODate('2010-10-10T15:01:00Z'), + total: 254, + count: 10, + mean: 25.4 } + } + +Note in particular the timestamp field in the aggregate document. This allows you +to incrementally update the various levels of the hierarchy. + +Operations +========== + +In the discussion below, it is assumed that all the events have been +inserted and appropriately timestamped, so your main operations are +aggregating from events into the smallest aggregate (the hourly totals) +and aggregating from smaller granularity to larger granularity. In each +case, the last time the particular aggregation is run is stored in a ``last_run`` +variable. (This variable might be loaded from MongoDB or another persistence +mechanism.) + +Aggregate From Events to the Hourly Level +----------------------------------------- + +Here, you want to load all the events since your last run until one minute +ago (to allow for some lag in logging events). The first thing you +need to do is create your map function. Even though this solution uses Python +and ``pymongo`` to interface with the MongoDB server, note that the various +functions (``mapf``, ``reducef``, and ``finalizef``) that is passed to the +``mapreduce`` command must be Javascript functions. The map function appears below: + +.. code-block:: python + + mapf_hour = bson.Code('''function() { + var key = { + u: this.userid, + d: new Date( + this.ts.getFullYear(), + this.ts.getMonth(), + this.ts.getDate(), + this.ts.getHours(), + 0, 0, 0); + emit( + key, + { + total: this.length, + count: 1, + mean: 0, + ts: new Date(); }); + }''') + +In this case, it emits key, value pairs which contain the +statistics you want to aggregate as you'd expect, but it also emits a `ts` +value. This will be used in the cascaded aggregations +(hour to day, etc.) to determine when a particular hourly aggregation +was performed. + +The reduce function is also fairly straightforward: + +.. code-block:: python + + reducef = bson.Code('''function(key, values) { + var r = { total: 0, count: 0, mean: 0, ts: null }; + values.forEach(function(v) { + r.total += v.total; + r.count += v.count; + }); + return r; + }''') + +A few things are notable here. First of all, note that the returned +document from the reduce function has the same format as the result of +map. This is a characteristic of map/reduce that it's nice +to maintain, as differences in structure between map, reduce, and +finalize results can lead to difficult-to-debug errors. Also note that +the ``mean`` and ``ts`` values are ignored in the ``reduce`` function. These will be +computed in the 'finalize' step: + +.. code-block:: python + + finalizef = bson.Code('''function(key, value) { + if(value.count > 0) { + value.mean = value.total / value.count; + } + value.ts = new Date(); + return value; + }''') + +The finalize function computes the mean value as well as the timestamp you'll +use to write back to the output collection. Now, to bind it all together, here +is the Python code to invoke the ``mapreduce`` command: + +.. code-block:: python + + cutoff = datetime.utcnow() - timedelta(seconds=60) + query = { 'ts': { '$gt': last_run, '$lt': cutoff } } + + + db.events.map_reduce( + map=mapf_hour, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': 'stats.hourly' }) + + + last_run = cutoff + +Through the use you the 'reduce' option on your output, you can safely run this +aggregation as often as you like so long as you update the ``last_run`` variable +each time. + +Index Support +~~~~~~~~~~~~~ + +Since you'll be running the initial query on the input events +frequently, you'd benefit significantly from an index on the +timestamp of incoming events: + +.. code-block:: python + + >>> db.stats.hourly.ensure_index('ts') + +Since you're always reading and writing the most recent events, this +index has the advantage of being right-aligned, which basically means MongoDB +only needs a thin slice of the index (the most recent values) in RAM to +achieve good performance. + +Aggregate from Hour to Day +-------------------------- + +In calculating the daily statistics, you'll use the hourly statistics +as input. The daily map function looks quite similar to the hourly map +function: + +.. code-block:: python + + mapf_day = bson.Code('''function() { + var key = { + u: this._id.u, + d: new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + this._id.d.getDate(), + 0, 0, 0, 0) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') + +There are a few differences to note here. First of all, the aggregation key is +the (userid, date) rather than (userid, hour) to allow +for daily aggregation. Secondly, note that the keys and values ``emit``\ ted +are actually the total and count values from the hourly aggregates +rather than properties from event documents. This will be the case in +all the higher-level hierarchical aggregations. + +Since you're using the same format for map output as was used in the +hourly aggregations, you can, in fact, use the same reduce and finalize +functions. The actual Python code driving this level of aggregation is +as follows: + +.. code-block:: python + + cutoff = datetime.utcnow() - timedelta(seconds=60) + query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } + + + db.stats.hourly.map_reduce( + map=mapf_day, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': 'stats.daily' }) + + + last_run = cutoff + +There are a couple of things to note here. First of all, the query is +not on ``ts`` now, but ``value.ts``, the timestamp written during the +finalization of the hourly aggregates. Also note that you are, in fact, +aggregating from the ``stats.hourly`` collection into the ``stats.daily`` +collection. + +Index support +~~~~~~~~~~~~~ + +Since you're going to be running the initial query on the hourly +statistics collection frequently, an index on 'value.ts' would be nice +to have: + +.. code-block:: python + + >>> db.stats.hourly.ensure_index('value.ts') + +Once again, this is a right-aligned index that will use very little RAM +for efficient operation. + +Other Aggregations +------------------ + +Once you have your daily statistics, you can use them to calculate your +weekly and monthly statistics. The weekly map function is as follows: + +.. code-block:: python + + mapf_week = bson.Code('''function() { + var key = { + u: this._id.u, + d: new Date( + this._id.d.valueOf() + - dt.getDay()*24*60*60*1000) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') + +Here, in order to get the group key, you simply takes the date and subtracts days +until you get to the beginning of the week. In the +weekly map function, you'll use the first day of the month as the +group key: + +.. code-block:: python + + mapf_month = bson.Code('''function() { + d: new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + 1, 0, 0, 0, 0) }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''') + +One thing in particular to notice about these map functions is that they +are identical to one another except for the date calculation. You can use +Python's string interpolation to refactor the map function definitions +as follows: + +.. code-block:: python + + mapf_hierarchical = '''function() { + var key = { + u: this._id.u, + d: %s }; + emit( + key, + { + total: this.value.total, + count: this.value.count, + mean: 0, + ts: null }); + }''' + + + mapf_day = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + this._id.d.getDate(), + 0, 0, 0, 0)''') + + + mapf_week = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.valueOf() + - dt.getDay()*24*60*60*1000)''') + + + mapf_month = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + this._id.d.getMonth(), + 1, 0, 0, 0, 0)''') + + + mapf_year = bson.Code( + mapf_hierarchical % '''new Date( + this._id.d.getFullYear(), + 1, 1, 0, 0, 0, 0)''') + +The Python driver can also be refactored so there is much less code +duplication: + +.. code-block:: python + + def aggregate(icollection, ocollection, mapf, cutoff, last_run): + query = { 'value.ts': { '$gt': last_run, '$lt': cutoff } } + icollection.map_reduce( + map=mapf, + reduce=reducef, + finalize=finalizef, + query=query, + out={ 'reduce': ocollection.name }) + +Once this is defined, you can perform all the aggregations as follows: + +.. code-block:: python + + cutoff = datetime.utcnow() - timedelta(seconds=60) + aggregate(db.events, db.stats.hourly, mapf_hour, cutoff, last_run) + aggregate(db.stats.hourly, db.stats.daily, mapf_day, cutoff, last_run) + aggregate(db.stats.daily, db.stats.weekly, mapf_week, cutoff, last_run) + aggregate(db.stats.daily, db.stats.monthly, mapf_month, cutoff, + last_run) + aggregate(db.stats.monthly, db.stats.yearly, mapf_year, cutoff, + last_run) + last_run = cutoff + +So long as you save/restore the ``last_run`` variable between +aggregations, you can run these aggregations as often as you like since +each aggregation individually is incremental. + +Index Support +~~~~~~~~~~~~~ + +Your indexes will continue to be on the value's timestamp to ensure +efficient operation of the next level of the aggregation (and they +continue to be right-aligned): + +.. code-block:: python + + >>> db.stats.daily.ensure_index('value.ts') + >>> db.stats.monthly.ensure_index('value.ts') + +Sharding +======== + +To take advantage of distinct shards when performing map/reduce, your +input collections should be sharded. In order to achieve good balancing +between nodes, you should make sure that the shard key is not +simply the incoming timestamp, but rather something that varies +significantly in the most recent documents. In this case, the username +makes sense as the most significant part of the shard key. + +In order to prevent a single, active user from creating a large, +unsplittable chunk, it's best to use a compound shard key with (username, +timestamp) on the events collection. + +.. code-block:: python + + >>> db.command('shardcollection','events', { + ... key : { 'userid': 1, 'ts' : 1} } ) + { "collectionsharded": "events", "ok" : 1 } + +In order to take advantage of sharding on +the aggregate collections, you *must* shard on the ``_id`` field (if you decide +to shard these collections:) + +.. code-block:: python + + >>> db.command('shardcollection', 'stats.daily') + { "collectionsharded" : "stats.daily", "ok": 1 } + >>> db.command('shardcollection', 'stats.weekly') + { "collectionsharded" : "stats.weekly", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly') + { "collectionsharded" : "stats.monthly", "ok" : 1 } + >>> db.command('shardcollection', 'stats.yearly') + { "collectionsharded" : "stats.yearly", "ok" : 1 } + +You should also update your map/reduce driver so that it notes the output +should be sharded. This is accomplished by adding 'sharded':True to the +output argument: + +.. code-block:: python + + ... out={ 'reduce': ocollection.name, 'sharded': True })... + diff --git a/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt new file mode 100644 index 00000000000..6f13b9a088b --- /dev/null +++ b/source/tutorial/usecase/real-time-analytics-preaggregated-reports.txt @@ -0,0 +1,543 @@ +=========================================== +Real Time Analytics: Pre-Aggregated Reports +=========================================== + +Problem +======= + +You have one or more servers generating events for which you want +real-time statistical information in a MongoDB collection. + +Solution Overview +================= + +This solution assumes the following: + +- There is no need to retain transactional event data in MongoDB, or + that retention is handled outside the scope of this use case. +- You need statistical data to be up-to-the minute (or up-to-the-second, + if possible.) +- The queries to retrieve time series of statistical data need to be as + fast as possible. + +The general approach is to use upserts and increments to generate the +statistics and simple range-based queries and filters to draw the time +series charts of the aggregated data. + +This use case assumes a simple scenario where you +want to count the number of hits to a collection of web site at various +levels of time-granularity (by minute, hour, day, week, and month) as +well as by path. + +Schema Design +============= + +There are two important considerations when designing the schema for a +real-time analytics system: the ease & speed of updates and the ease & +speed of queries. In particular, you want to avoid the following +performance-killing circumstances: + +- documents changing in size significantly, causing reallocations on + disk +- queries that require large numbers of disk seeks in order to be satisfied +- document structures that make accessing a particular field slow + +One approach you *could* use to make updates easier would be to keep your +hit counts in individual documents, one document per +minute/hour/day/etc. This approach, however, requires you to query +several documents for nontrivial time range queries, slowing down your +queries significantly. In order to keep your queries fast, it's actually better +to use somewhat more complex documents, keeping several aggregate +values in each document. + +In order to illustrate some of the other issues you might encounter, the +following are several schema designs that you might try as well as discussion of +the problems each of with them. + +Design 0: One Document Per Page/Day +----------------------------------- + +Initially, you might try putting all the statistics you need into a single +document per page: + +.. code-block:: javascript + + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: 5468426, + hourly: { + "0": 227850, + "1": 210231, + ... + "23": 20457 }, + minute: { + "0": 3612, + "1": 3241, + ... + "1439": 2819 } + } + +This approach has a couple of advantages: + +- It only requires a single update per hit to the website. +- Intra-day reports for a single page require fetching only a single document. + +There are, however, significant +problems with this approach. The biggest problem is that, as you upsert +data into the ``hy`` and ``mn`` properties, the document grows. Although +MongoDB attempts to pad the space required for documents, it will still +end up needing to reallocate these documents multiple times throughout +the day, copying the documents to areas with more space and impacting performance. + +Design #0.5: Preallocate Documents +---------------------------------- + +In order to mitigate the repeated copying of documents, you can tweak your +approach slightly by adding a process which will preallocate a document +with initial zeros during the previous day. In order to avoid a +situation where you preallocate documents *en masse* at midnight, it's best to +randomly (with a low probability) upsert the next day's document each +time you update the current day's statistics. This requires some tuning; +you'd like to have almost all the documents preallocated by the end of +the day, without spending much time on extraneous upserts (preallocating +a document that's already there). A reasonable first guess would be to +look at the average number of hits per day (call it *hits*) and +preallocate with a probability of *1/hits*. + +Preallocating helps performance mainly by ensuring that all the various 'buckets' +are initialized with 0 hits. Once the document is initialized, then, it +will never dynamically grow, meaning a) there is no need to perform the +reallocations that could slow us down in design #0 and b) MongoDB +doesn't need to pad the records, leading to a more compact +representation and better usage of your memory. + +Design #1: Add Intra-Document Hierarchy +--------------------------------------- + +One thing to be aware of with BSON is that documents are stored as a +sequence of (key, value) pairs, *not* as a hash table. What this means +for us is that writing to ``stats.mn.0`` is *much* faster than writing to +``stats.mn.1439``. + +.. figure:: img/rta-preagg1.png + :align: center + :alt: BSON memory layout (unoptimized) + + In order to update the value in minute #1349, MongoDB must skip over all 1349 + entries before it. + +In order to speed this up, you can introduce some intra-document +hierarchy. In particular, you can split the ``mn`` field up into 24 hourly +fields: + +.. code-block:: javascript + + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: 5468426, + hourly: { + "0": 227850, + "1": 210231, + ... + "23": 20457 }, + minute: { + "0": { + "0": 3612, + "1": 3241, + ... + "59": 2130 }, + "1": { + "60": ... , + }, + ... + "23": { + ... + "1439": 2819 } + } + } + +This allows MongoDB to "skip forward" when updating the minute +statistics later in the day, making your performance more uniform and +generally faster. + +.. figure:: img/rta-preagg2.png + :align: center + :alt: BSON memory layout (optimized) + + To update the value in minute #1349, MongoDB first skips the first 23 hours and + then skips 59 minutes for only 82 skips as opposed to 1439 skips in the + previous schema. + +Design #2: Create separate documents for different granularities +---------------------------------------------------------------- + +Design #1 is certainly a reasonable design for storing intraday +statistics, but what happens when you want to draw a historical chart +over a month or two? In that case, you need to fetch 30+ individual +documents containing or daily statistics. A better approach would be to +store daily statistics in a separate document, aggregated to the month. +This does introduce a second upsert to the statistics generation side of +your system, but the reduction in disk seeks on the query side should +more than make up for it. At this point, the document structure is as +follows: + +Daily Statistics +~~~~~~~~~~~~~~~~ + +.. code-block:: javascript + + { + _id: "20101010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-10T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + hourly: { + "0": 227850, + "1": 210231, + ... + "23": 20457 }, + minute: { + "0": { + "0": 3612, + "1": 3241, + ... + "59": 2130 }, + "1": { + "0": ..., + }, + ... + "23": { + "59": 2819 } + } + } + +Monthly Statistics +~~~~~~~~~~~~~~~~~~ + +.. code-block: javascript:: + + { + _id: "201010/site-1/apache_pb.gif", + metadata: { + date: ISODate("2000-10-00T00:00:00Z"), + site: "site-1", + page: "/apache_pb.gif" }, + daily: { + "1": 5445326, + "2": 5214121, + ... } + } + +This is actually the schema design that this use case uses, since it allows for a +good balance between update efficiency and query performance. + +Operations +========== + +In this system, you'd like to balance between read performance and write +(upsert) performance. This section will describe each of the major +operations you might perform, using the Python programming language and the +pymongo MongoDB driver. These operations would be similar in other +languages as well. + +Log a Hit to a Page +------------------- + +Logging a hit to a page in your website is the main "write" activity in +your system. In order to maximize performance, you'll be doing in-place +updates with the upsert operation: + +.. code-block:: python + + from datetime import datetime, time + + + def log_hit(db, dt_utc, site, page): + + + # Update daily stats doc + id_daily = dt_utc.strftime('%Y%m%d/') + site + page + hour = dt_utc.hour + minute = dt_utc.minute + + + # Get a datetime that only includes date info + d = datetime.combine(dt_utc.date(), time.min) + query = { + '_id': id_daily, + 'metadata': { 'date': d, 'site': site, 'page': page } } + update = { '$inc': { + 'hourly.%d' % (hour,): 1, + 'minute.%d.%d' % (hour,minute): 1 } } + db.stats.daily.update(query, update, upsert=True) + + + # Update monthly stats document + id_monthly = dt_utc.strftime('%Y%m/') + site + page + day_of_month = dt_utc.day + query = { + '_id': id_monthly, + 'metadata': { + 'date': d.replace(day=1), + 'site': site, + 'page': page } } + update = { '$inc': { + 'daily.%d' % day_of_month: 1} } + db.stats.monthly.update(query, update, upsert=True) + +Since you're using the upsert operation, this function will perform +correctly whether the document is already present or not, which is +important, as your preallocation (the next operation) will only +preallocate documents with a high probability, not 100% certainty. Note however, +that without preallocation, you end up with a dynamically growing document, +slowing down your upserts significantly as documents are moved in order +to grow them. + +Preallocate +----------- + +In order to keep your documents from growing, you can preallocate them +before they are needed. When preallocating, you set all the statistics to +zero for all time periods so that later, so that the document doesn't need to +grow to accomodate the upserts. Here, is this preallocation as its +own function: + +.. code-block:: python + + def preallocate(db, dt_utc, site, page): + + + # Get id values + id_daily = dt_utc.strftime('%Y%m%d/') + site + page + id_monthly = dt_utc.strftime('%Y%m/') + site + page + + + # Get daily metadata + daily_metadata = { + 'date': datetime.combine(dt_utc.date(), time.min), + 'site': site, + 'page': page } + # Get monthly metadata + monthly_metadata = { + 'date': daily_m['d'].replace(day=1), + 'site': site, + 'page': page } + + + # Initial zeros for statistics + hourly = dict((str(i), 0) for i in range(24)) + minute = dict( + (str(i), dict((str(j), 0) for j in range(60))) + for i in range(24)) + daily = dict((str(i), 0) for i in range(1, 32)) + + + # Perform upserts, setting metadata + db.stats.daily.update( + { + '_id': id_daily, + 'hourly': hourly, + 'minute': minute}, + { '$set': { 'metadata': daily_metadata }}, + upsert=True) + db.stats.monthly.update( + { + '_id': id_monthly, + 'daily': daily }, + { '$set': { 'm': monthly_metadata }}, + upsert=True) + +In this case, note that the function went ahead and preallocated the monthly +document while preallocating the daily document. While you could +split this into its own function and preallocated monthly documents +less frequently that daily documents, the performance difference is +negligible, so the above code is reasonable solution. + +The next question you must answer is *when* you should preallocate. You'd +like to have a high likelihood of the document being preallocated before +it is needed, but you don't want to preallocate all at once (say at +midnight) to ensure you don't create a spike in activity and a +corresponding increase in latency. The solution here is to +probabilistically preallocate each time a hit is logged, with a probability +tuned to make preallocation likely without performing too many +unnecessary calls to preallocate: + +.. code-block:: python + + from random import random + from datetime import datetime, timedelta, time + + + # Example probability based on 500k hits per day per page + prob_preallocate = 1.0 / 500000 + + + def log_hit(db, dt_utc, site, page): + if random.random() < prob_preallocate: + preallocate(db, dt_utc + timedelta(days=1), site_page) + # Update daily stats doc + ... + +Now with a high probability, you'll preallocate each document before +it's used, preventing the midnight spike as well as eliminating the +movement of dynamically growing documents. + +Get Data for a Real-Time Chart +------------------------------ + +One chart that you might be interested in seeing would be the number of +hits to a particular page over the last hour. In that case, the query is +fairly straightforward: + +.. code-block:: python + + >>>``db.stats.daily.find_one( + ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, + ... { 'minute': 1 }) + +Likewise, you can get the number of hits to a page over the last day, +with hourly granularity: + +.. code-block:: python + + >>> db.stats.daily.find_one( + ... {'metadata': {'date':dt, 'site':'site-1', 'page':'/foo.gif'}}, + ... { 'hy': 1 }) + +If you want a few days' worth of hourly data, you can get it using the +following query: + +.. code-block:: python + + >>> db.stats.daily.find( + ... { + ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, + ... 'metadata.site': 'site-1', + ... 'metadata.page': '/foo.gif'}, + ... { 'metadata.date': 1, 'hourly': 1 } }, + ... sort=[('metadata.date', 1)]) + +In this case, you're retrieving the date along with the statistics since +it's possible (though highly unlikely) that you could have a gap of one +day where a) you didn't happen to preallocate that day and b) there were +no hits to the document on that day. + +Index Support +~~~~~~~~~~~~~ + +These operations would benefit significantly from indexes on the +metadata of the daily statistics: + +.. code-block:: python + + >>> db.stats.daily.ensure_index([ + ... ('metadata.site', 1), + ... ('metadata.page', 1), + ... ('metadata.date', 1)]) + +Note in particular that the index is first on site, then page, then date. This +allows you to perform the third query above (a single page over a range +of days) quite efficiently. Having any compound index on page and date, +of course, allows you to look up a single day's statistics efficiently. + +Get Data for a Historical Chart +------------------------------- + +In order to retrieve daily data for a single month, you can perform the +following query: + +.. code-block:: python + + >>> db.stats.monthly.find_one( + ... {'metadata': + ... {'date':dt, + ... 'site': 'site-1', + ... 'page':'/foo.gif'}}, + ... { 'daily': 1 }) + +If you want several months' worth of daily data, of course, you can do the +same trick as above: + +.. code-block:: python + + >>> db.stats.monthly.find( + ... { + ... 'metadata.date': { '$gte': dt1, '$lte': dt2 }, + ... 'metadata.site': 'site-1', + ... 'metadata.page': '/foo.gif'}, + ... { 'metadata.date': 1, 'hourly': 1 } }, + ... sort=[('metadata.date', 1)]) + +Index support +~~~~~~~~~~~~~ + +Once again, these operations would benefit significantly from indexes on +the metadata of the monthly statistics: + +.. code-block:: python + + >>> db.stats.monthly.ensure_index([ + ... ('metadata.site', 1), + ... ('metadata.page', 1), + ... ('metadata.date', 1)]) + +The order of the index is once again designed to efficiently support +range queries for a single page over several months, as above. + +Sharding +======== + +The performance of this system will be limited by the number of shards +in your cluster as well as the choice of your shard key. The ideal shard +key balances upserts beteen the shards evenly while routing any +individual query to a single shard (or a small number of shards). A +reasonable shard key for this solution would thus be (``metadata.site``, +``metadata.page``), the site-page combination for which you're calculating +statistics: + +.. code-block:: python + + >>> db.command('shardcollection', 'stats.daily', { + ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) + { "collectionsharded" : "stats.daily", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly', { + ... key : { 'metadata.site': 1, 'metadata.page' : 1 } }) + { "collectionsharded" : "stats.monthly", "ok" : 1 } + +One downside to using (``metadata.site``,``metadata.page``) as your shard +key is that, if one page dominates all your traffic, all updates to that +page will go to a single shard. The problem, however, is largely +unavoidable, since all update for a single page are going to a single +*document*. + +You also have the problem using only (``metadata.site``, ``metadata.page``) +shard key that, if a high percentage of your queries go to the same page, +these will all be handled by the same shard. A (slightly) better shard +key would the include the date as well as the site/page so that you could +serve different historical ranges with different shards: + +.. code-block:: python + + >>> db.command('shardcollection', 'stats.daily', { + ... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}}) + { "collectionsharded" : "stats.daily", "ok" : 1 } + >>> db.command('shardcollection', 'stats.monthly', { + ... key:{'metadata.site':1,'metadata.page':1,'metadata.date':1}}) + { "collectionsharded" : "stats.monthly", "ok" : 1 } + +It is worth noting in this discussion of sharding that, depending on the +number of sites/pages you are tracking and the number of hits per page, +you're talking about a fairly small set of data with modest performance +requirements, so sharding may be overkill. In the case of the MongoDB +Monitoring Service (MMS), a single shard is able to keep up with the +totality of traffic generated by all the customers using this (free) +service. diff --git a/source/tutorial/usecase/real-time-analytics-storing-log-data.txt b/source/tutorial/usecase/real-time-analytics-storing-log-data.txt new file mode 100644 index 00000000000..9e24872de2c --- /dev/null +++ b/source/tutorial/usecase/real-time-analytics-storing-log-data.txt @@ -0,0 +1,587 @@ +===================================== +Real Time Analytics: Storing Log Data +===================================== + +Problem +======= + +You have one or more servers generating events that you would like to +persist to a MongoDB collection. + +Solution Overview +================= + +This solution will assume that each server generating events, as well as the +consumer of the event data, has access to the MongoDB server(s). Furthermore, +this design will optimize based on the assumption that the query +rate is (substantially) lower than the insert rate (as is most often the +case when logging a high-bandwidth event stream). + +Schema Design +============= + +The schema design in this case will depend largely on the particular +format of the event data you want to store. For a simple example, let's +take standard request logs from the Apache web server using the combined +log format. This example assumes you're using an uncapped +collection to store the event data. A line from such a log file might +look like the following: + +.. code-block:: text + + 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)" + +The simplest approach to storing the log data would be putting the exact +text of the log record into a document: + +.. code-block:: javascript + + { + _id: ObjectId('4f442120eb03305789000000'), + line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"' + } + +While this is a possible solution, it's not likely to be the optimal +solution. For instance, if you decided you wanted to find events that hit +the same page, you'd need to use a regular expression query, which +would require a full collection scan. A better approach would be to +extract the relevant fields into individual properties. When doing the +extraction, it's important to pay attention to the choice of data types for the +various fields. For instance, the date field in the log line +``[10/Oct/2000:13:55:36 -0700]`` is 28 bytes long. If you instead store +this as a UTC timestamp, it shrinks to 8 bytes. Storing the date as a +timestamp also allows you to make date range +queries, whereas comparing two date *strings* is nearly useless. A +similar argument applies to numeric fields; storing them as strings is +suboptimal, taking up more space and making the appropriate types of +queries much more difficult. + +It's also important to consider what information you might want to omit from the +log record. For instance, if you wanted to record exactly what was in the +log record, you might create a document like the following: + +.. code-block:: javascript + + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + logname: null, + user: 'frank', + time: , + request: "GET /apache_pb.gif HTTP/1.0", + status: 200, + response_size: 2326, + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } + + +In most cases, however, you're probably are only interested in a subset of +the data about the request. Here, you may want to keep the host, time, +path, user agent, and referer for a web analytics application: + +.. code-block:: javascript + + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + time: ISODate("2000-10-10T20:55:36Z"), + path: "/apache_pb.gif", + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } + +It might even be possible to remove the time, since ``ObjectId``\ s embed +the time they are created: + +.. code-block:: javascript + + { + _id: ObjectId('4f442120eb03305789000000'), + host: "127.0.0.1", + time: ISODate("2000-10-10T20:55:36Z"), + path: "/apache_pb.gif", + referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + } + +System Architecture +------------------- + +Event logging systems are mainly concerned with two +performance considerations: 1) how many inserts per second can it +perform (this will limit its event throughput) and 2) how will the system manage +the growth of event data. Concerning insert performance, the best way to +scale the architecture is via sharding. + +Operations +========== + +The main performance-critical operation in storing +an event log is the insertion speed. However, you also need to be able to +query the event data for relevant statistics. This section will describe +each of these operations, using the Python programming language and the +``pymongo`` MongoDB driver. These operations would be similar in other +languages as well. + +Inserting a Log Record +---------------------- + +In many event logging applications, you might accept some degree of risk +when it comes to dropping events. In others, you need to be absolutely +sure you don't drop any events. MongoDB supports both models. In the case +where you can tolerate a risk of loss, you can insert records +*asynchronously* using a fire-and-forget model: + +.. code-block:: python + + >>> import bson + >>> import pymongo + >>> from datetime import datetime + >>> conn = pymongo.Connection() + >>> db = conn.event_db + >>> event = { + ... _id: bson.ObjectId(), + ... host: "127.0.0.1", + ... time: datetime(2000,10,10,20,55,36), + ... path: "/apache_pb.gif", + ... referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", + ... user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" + ...} + >>> db.events.insert(event, safe=False) + +This is the fastest approach, as this code doesn't even require a +round-trip to the MongoDB server to ensure that the insert was received. +It is thus also the riskiest approach, as network and server failures (such as +DuplicateKeyErrors on a unique index) will go undetected. If you want to make +sure you have an acknowledgement from the +server that your insertion succeeded (for some definition of success), you +can pass safe=True: + +.. code-block:: python + + >>> db.events.insert(event, safe=True) + +If your tolerance for data loss risk is somewhat less, you can require +that the server to which you write the data has committed the event to +the on-disk journal before you continue operation (``safe=True`` is +implied by all the following options:) + +.. code-block:: python + + >>> db.events.insert(event, j=True) + +Finally, if you have *extremely low* tolerance for event data loss, you +can require the data to be replicated to multiple secondary servers +before returning: + +.. code-block:: python + + >>> db.events.insert(event, w=2) + +In this case, you will get acknowledgement that the data has been +replicated to 2 replicas. You can combine options as well: + +.. code-block:: python + + >>> db.events.insert(event, j=True, w=2) + +In this case, the insert waits on both a journal commit *and* a +replication acknowledgement. Although this is the safest option, it is +also the slowest, so you should be aware of the trade-off when +performing your inserts. + +.. note:: + + If at all possible in your application architecture, you should consider + using bulk inserts to insert event data. All the options discussed above + apply to bulk inserts, but you can actually pass multiple events as the + first parameter to .insert(). By passing multiple documents into a + single insert() call, MongoDB are able to amortize the performance penalty you + incur by using the 'safe' options such as j=True or w=2. + +Finding All the Events for a Particular Page +-------------------------------------------- + +For a web analytics-type operation, getting the logs for a particular +web page might be a common operation for which you would want to optimize. +In this case, the query would be as follows: + +.. code-block:: python + + >>> q_events = db.events.find({'path': '/apache_pb.gif'}) + +Note that the sharding setup you use (should you decide to shard this +collection) has performance implications for this operation. For +instance, if you shard on the 'path' property, then this query will be +handled by a single shard, whereas if you shard on some other property or +combination of properties, the mongos instance will be forced to do a +scatter/gather operation which involves *all* the shards. + +Index Support +~~~~~~~~~~~~~ + +This operation would benefit significantly from an index on the 'path' +attribute: + +.. code-block:: python + + >>> db.events.ensure_index('path') + +One potential downside to this index is that it is relatively randomly +distributed, meaning that for efficient operation the entire index +should be resident in RAM. Since there is likely to be a relatively +small number of distinct paths in the index, however, this will probably +not be a problem. + +Finding All the Events for a Particular Date +-------------------------------------------- + +You may also want to find all the events for a particular date. In this +case, you'd perform the following query: + +.. code-block:: python + + >>> q_events = db.events.find('time': + ... { '$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)}) + +Index Support +~~~~~~~~~~~~~ + +In this case, an index on 'time' would provide optimal performance: + +.. code-block:: python + + >>> db.events.ensure_index('time') + +One of the nice things about this index is that it is *right-aligned*. +Since you are always inserting events in ascending time order, the +right-most slice of the B-tree will always be resident in RAM. So long +as your queries focus mainly on recent events, the *only* part of the +index that needs to be resident in RAM is the right-most slice of the +B-tree, allowing MongoDB to keep quite a large index without using up much of +the system memory. + +Finding All the Events for a Particular Host/Date +------------------------------------------------- + +You might also want to analyze the behavior of a particular host on a +particular day, perhaps for analyzing suspicious behavior by a +particular IP address. In that case, you'd write a query such as: + +.. code-block:: python + + >>> q_events = db.events.find({ + ... 'host': '127.0.0.1', + ... 'time': {'$gte':datetime(2000,10,10),'$lt':datetime(2000,10,11)} + ... }) + +Index Support +~~~~~~~~~~~~~ + +Once again, your choice of indexes affects the performance +characteristics of this query significantly. For instance, suppose you +create a compound index on (time, host): + +.. code-block:: python + + >>> db.events.ensure_index([('time', 1), ('host', 1)]) + +In this case, the query plan would be the following (retrieved via +``q_events.explain()``): + +.. code-block: python + + { ... + u'cursor': u'BtreeCursor time_1_host_1', + u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], + u'time': [ + [ datetime.datetime(2000, 10, 10, 0, 0), + datetime.datetime(2000, 10, 11, 0, 0)]] + }, + ... + u'millis': 4, + u'n': 11, + u'nscanned': 1296, + u'nscannedObjects': 11, + ... } + +If, however, you create a compound index on (host, time)... + +.. code-block:: python + + >>> db.events.ensure_index([('host', 1), ('time', 1)]) + +then get a much more efficient query plan and much better performance: + +.. code-block:: python + + { ... + u'cursor': u'BtreeCursor host_1_time_1', + u'indexBounds': {u'host': [[u'127.0.0.1', u'127.0.0.1']], + u'time': [[datetime.datetime(2000, 10, 10, 0, 0), + datetime.datetime(2000, 10, 11, 0, 0)]]}, + ... + u'millis': 0, + u'n': 11, + ... + u'nscanned': 11, + u'nscannedObjects': 11, + ... + } + +In this case, MongoDB is able to visit just 11 entries in the index to +satisfy the query, whereas in the first it needed to visit 1296 entries. +This is because the query using (host, time) needs to search the index +range from ``('127.0.0.1', datetime(2000,10,10))`` to +``('127.0.0.1', datetime(2000,10,11))`` to satisfy the above query, whereas if you +used (time, host), the index range would be ``(datetime(2000,10,10), MIN_KEY)`` +to ``(datetime(2000,10,10), MAX_KEY)``, a much larger range (in this case, +1296 entries) which will yield a correspondingly slower performance. + +Although the index order has an impact on the performance of the query, +one thing to keep in mind is that an index scan is still *much* faster than a +collection scan. So using a (time, host) index would still be much +faster than an index on (time) alone. There is also the issue of +right-alignedness to consider, as the (time, host) index will be +right-aligned but the (host, time) index will not, and it's possible +that the right-alignedness of a (time, host) index will make up for the +increased number of index entries that need to be visited to satisfy +this query. + +Counting the Number of Requests by Day and Page +----------------------------------------------- + +MongoDB 2.1 introduced a new aggregation framework that allows you to +perform queries that aggregate large numbers of documents significantly +faster than the old 'mapreduce' and 'group' commands in prior versions +of MongoDB. Suppose you'd like to find out how many requests there were for +each day and page over the last month, for instance. In this case, you +could build up the following aggregation pipeline: + +.. code-block:: python + + >>> result = db.command('aggregate', 'events', pipeline=[ + ... { '$match': { + ... 'time': { + ... '$gte': datetime(2000,10,1), + ... '$lt': datetime(2000,11,1) } } }, + ... { '$project': { + ... 'path': 1, + ... 'date': { + ... 'y': { '$year': '$time' }, + ... 'm': { '$month': '$time' }, + ... 'd': { '$dayOfMonth': '$time' } } } }, + ... { '$group': { + ... '_id': { + ... 'p':'$path', + ... 'y': '$date.y', + ... 'm': '$date.m', + ... 'd': '$date.d' }, + ... 'hits': { '$sum': 1 } } }, + ... ]) + +The performance of this aggregation is dependent, of course, on your +choice of shard key if we're sharding. What you'd like to ensure is that +all the items in a particular 'group' are on the same server, which you +can do by sharding on date (probably not wise, as discussed below) or +path (possibly a good idea). + +Index Support +~~~~~~~~~~~~~ + +In this case, you want to make sure you have an index on the initial +$match query: + +.. code-block:: python + + >>> db.events.ensure_index('time') + +If you already have an index on (time, host) as discussed above, +however, there is no need to create a separate index on 'time' alone, +since the (time, host) index can be used to satisfy range queries on +'time' alone. + +Sharding +======== + +Your insertion rate is going to be limited by the number of shards you +maintain in your cluster as well as by the choice of a shard key. The +choice of a shard key is important because MongoDB uses *range-based +sharding* . What you *want* to happen is for the insertions to be +balanced equally among the shards, so you'd like to avoid using something +like a timestamp, sequence number, or ``ObjectId`` as a shard key, as new +inserts would tend to cluster around the same values (and thus the same +shard). But what you also *want* to happen is for each of your queries to +be routed to a single shard. The following are the pros and cons of each +approach. + +Option 0: Shard on Time +----------------------- + +Although an ``ObjectId`` or timestamp might seem to be an attractive +sharding key at first, particularly given the right-alignedness of the +index, it turns out to provide the worst of all worlds when it comes to +read and write performance. In this case, all of your inserts will always +flow to the same shard, providing no performance benefit write-side from +sharding. Your reads will also tend to cluster in the same shard, so you +would get no performance benefit read-side either. + +Option 1: Shard On a Random(ish) Key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Suppose instead that you decided to shard on a key with a random +distribution, say the md5 or sha1 hash of the ``_id`` field: + +.. code-block:: python + + >>> from bson import Binary + >>> from hashlib import sha1 + >>> + >>> # Introduce the synthetic shard key (this should actually be done at + >>> # event insertion time) + >>> + >>> for ev in db.events.find({}, {'_id':1}): + ... ssk = Binary(sha1(str(ev._id))).digest()) + ... db.events.update({'_id':ev['_id']}, {'$set': {'ssk': ssk} }) + ... + >>> db.command('shardcollection', 'events', { + ... key : { 'ssk' : 1 } }) + { "collectionsharded" : "events", "ok" : 1 } + +This does introduce some complexity into your application in order to +generate the random key, but it provides you linear scaling on your +inserts, so 5 shards should yield a 5x speedup in inserting. The +downsides to using a random shard key are the following: a) the shard +key's index will tend to take up more space (and you need an index to +determine where to place each new insert), and b) queries (unless they +include the synthetic, random-ish shard key) will need to be distributed +to all your shards in parallel. This may be acceptable, since in this +scenario write performance is much more important than read +performance, but you should be aware of the downsides to using a random +key distribution. + +Option 2: Shard On a Naturally Evenly-Distributed Key +----------------------------------------------------- + +In this case, you might choose to shard on the 'path' attribute, since it +seems to be relatively evenly distributed: + +.. code-block:: python + + >>> db.command('shardcollection', 'events', { + ... key : { 'path' : 1 } }) + { "collectionsharded" : "events", "ok" : 1 } + +This has a couple of advantages: a) writes tend to be evenly balanced, +and b) reads tend to be selective (assuming they include the 'path' +attribute in the query). There is a potential downside to this approach, +however, particularly in the case where there are a limited number of +distinct values for the path. In that case, you can end up with large +shard 'chunks' that cannot be split or rebalanced because they contain +only a single shard key. The rule of thumb here is that you should not +pick a shard key which allows large numbers of documents to have the +same shard key since this prevents rebalancing. + +Option 3: Combine a Natural and Synthetic Key +--------------------------------------------- + +This approach is perhaps the best combination of read and write +performance for the application. You can define the shard key to be +(path, sha1(\_id)): + +.. code-block:: python + + >>> db.command('shardcollection', 'events', { + ... key : { 'path' : 1, 'ssk': 1 } }) + { "collectionsharded" : "events", "ok" : 1 } + +You still need to calculate a synthetic key in the application client, +but in return you get good write balancing as well as good read +selectivity. + +Sharding Conclusion: Test With Your Own Data +-------------------------------------------- + +Picking a good shard key is unfortunately still one of those decisions +that is simultaneously difficult to make, high-impact, and difficult to +change once made. The particular mix of reading and writing, as well as +the particular queries used, all have a large impact on the performance +of different sharding configurations. Although you can choose a +reasonable shard key based on the considerations above, the best +approach is to analyze the actual insertions and queries you are using +in your own application. + +Variation: Capped Collections +============================= + +One variation that you may want to consider based on your data retention +requirements is whether you might be able to use a `capped +collection `_ to +store your events. Capped collections might be a good choice if you know +you will process the event documents in a timely manner and you don't +have exacting data retention requirements on the event data. Capped +collections have the advantage of never growing beyond a certain size +(they are allocated as a circular buffer on the disk) and having +documents 'fall out' of the buffer in their insertion order. Uncapped +collections (the default) will persist documents until they are +explicitly removed from the collection or the collection is dropped. + +Appendix: Managing Event Data Growth +==================================== + +MongoDB databases, in the course of normal operation, never relinquish +disk space back to the file system. This can create difficulties if you +don't manage the size of your databases up front. For event data, you +have a few options for managing the data growth: + +Single Collection +----------------- + +This is the simplest option: keep all events in a single collection, +periodically removing documents that you don't need any more. The +advantage of simplicity, however, is offset by some performance +considerations. First, when you execute your remove, MongoDB will actually +bring the documents being removed into memory. Since these are documents +that presumably you haven't touched in a while (that's why you're deleting +them), this will force more relevant data to be flushed out to disk. +Second, in order to do a reasonably fast remove operation, you probably +want to keep an index on a timestamp field. This will tend to slow down +your inserts, as the inserts have to update the index as well as write +the event data. Finally, removing data periodically will also be the +option that has the most potential for fragmenting the database, as +MongoDB attempts to reuse the space freed by the remove operations for +new events. + +Multiple Collections, Single Database +------------------------------------- + +The next option is to periodically *rename* your event collection, +rotating collections in much the same way you might rotate log files. You +would then drop the oldest collection from the database. This has +several advantages over the single collection approach. First off, +collection renames are both fast and atomic. Secondly, you don't actually +have to touch any of the documents to drop a collection. Finally, since +MongoDB allocates storage in *extents* that are owned by collections, +dropping a collection will free up entire extents, mitigating the +fragmentation risk. The downside to using multiple collections is +increased complexity, since you will probably need to query both the +current event collection and the previous event collection for any data +analysis you perform. + +Multiple Databases +------------------ + +In the multiple database option, you take the multiple collection option +a step further. Now, rather than rotating your collections, you will +rotate your databases. At the cost of rather increased complexity both in +insertions and queries, you do gain one benefit: as your databases get +dropped, disk space gets returned to the operating system. This option +would only really make sense if you had extremely variable event +insertion rates or if you had variable data retention requirements. For +instance, if you are performing a large backfill of event data and want +to make sure that the entire set of event data for 90 days is available +during the backfill, but can be reduced to 30 days in ongoing +operations, you might consider using multiple databases. The complexity +cost for multiple databases, however, is significant, so this option +should only be taken after thorough analysis.