From 3a283e1303762287946ddd853acdfd2035ff8a4e Mon Sep 17 00:00:00 2001 From: Tyler Brock Date: Thu, 7 Jun 2012 18:36:09 -0400 Subject: [PATCH] Updates to sharding documentation draft --- draft/core/sharding.txt | 173 ++++++++++++++++++++-------------------- 1 file changed, 88 insertions(+), 85 deletions(-) diff --git a/draft/core/sharding.txt b/draft/core/sharding.txt index 0f83335573f..1d0ebd03611 100644 --- a/draft/core/sharding.txt +++ b/draft/core/sharding.txt @@ -7,12 +7,13 @@ Sharding Fundamentals .. default-domain:: mongodb -MognoDB's sharding allows users to :term:`partition` the data of a -:term:`collection` within a database so that the documents are -automatically distributed among a number of :program:`mongod` -instances or :term:`shards `. These clusters increase write -capacity and allow a single database instance to have a larger working -set and total data size than a single instance could provide. +MongoDB's sharding system allows users to :term:`partition` the data of a +:term:`collection` within a database so that the documents are distributed +across a number of :program:`mongod` instances or :term:`shards `. + +Sharded clusters allow increases in write capacity, provide the ability to +support larger working sets, and raise the limits of total data size beyond +the physical resources of a single node. This document provides an overview of the fundamental concepts and operation of sharding with MongoDB. @@ -40,7 +41,7 @@ MongoDB has the following features: .. glossary:: - Range-based Sharding + Range-based Partitioning of Data MongoDB distributes documents among :term:`shards ` based on the value of the :ref:`shard key `. Each :term:`chunk` represents a block of :term:`documents ` @@ -49,7 +50,7 @@ MongoDB has the following features: divides the chunks into smaller chunks (i.e. :term:`splitting `) based on the shard key. - Automatic Sharding + Automatic Distribution of Data Volume The sharding system automatically balances data across the cluster without intervention from the application layer. Effective automatic sharding depends on a well chosen @@ -57,59 +58,57 @@ MongoDB has the following features: additional complexity, modifications, or intervention from developers. - Transparent Sharding + Transparent Routing of Queries Sharding is completely transparent to the application layer, - because all connections to a sharded cluster go through the - :program:`mongos` instances. Sharding in MongoDB requires some + because all connections to a sharded cluster go through + :program:`mongos`. Sharding in MongoDB requires some :ref:`basic initial configuration `, but ongoing function is entirely transparent to the application. - Sharding Capacity + Horizontally Scalable Capacity Sharding increases capacity in two ways: - #. Given an even distribution of data with an effective - :term:`shard key`, sharding can provide additional write - capacity by increasing the number of :program:`mongod` - instances. + #. Effective partitioning of data can provide additional write + capacity by distributing the write load over a number of + :program:`mongod` instances. - #. Give a shard key with sufficient :ref:`cardinality - `, sharding makes it possible - to distribute data among a collection of :program:`mongod` - instances, and increase the potential amount of data to mange + #. Given a shard key with sufficient :ref:`cardinality + `, partitioning data allows + users to increase the potential amount of data to mange with MongoDB and expand the :term:`working set`. -A typical :term:`shard cluster` consists of the config servers that -provide metadata that maps :term:`chunks ` to shards, the -:program:`mongod` instances that hold the data (i.e the :term:`shards +A typical :term:`shard cluster` consists of config servers which +store metadata that maps :term:`chunks ` to shards, the +:program:`mongod` instances which hold data (i.e the :term:`shards `,) and lightweight routing processes, :doc:`mongos -`, that routes operations to the correct shard -based on the operation and the cluster metadata. +`, which route operations to the correct shard +based the cluster configuration. Indications ~~~~~~~~~~~ While sharding is a powerful and compelling feature, it comes with significant :ref:`infrastructure requirements ` -and some limited complexity costs. As a result its important to use +and some limited complexity costs. As a result, its important to use sharding only as necessary, and when indicated by actual operational -requirements. Consider the following overview of indications, which is -a simple "*when you should shard,*" guide. +requirements. Consider the following overview of indications it may be +time to consider sharding. You should consider deploying a :term:`shard cluster`, if: -- your data set exceeds the storage capacity of a single node in your - system. +- your data set is approaching or exceeds the storage capacity of a single + node in your system. - the size of your system's active :term:`working set` *will soon* exceed the capacity of the *maximum* amount of RAM for your system. -- your system has a large amount of write activity, and a single +- your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and - all other approaches have not reduced the contention. + all other approaches have not reduced contention. -If these are not true of your system, sharding may add too much -complexity to your system without providing much benefit to your -system. If you do plan to shard your data eventually, you should also +If these attributes are not present in your system, sharding will only +add additional complexity to your system without providing much benefit. +If you do plan to eventually partition your data, you should also give some thought to which collections you'll want to shard along with the corresponding shard keys. @@ -120,7 +119,7 @@ the corresponding shard keys. difficult time deploying sharding without impacting your application. - As a result if you know you're going to need sharding eventually, + As a result, if you think you're going to need sharding eventually, its crucial that you **do not** wait until your system is overcapacity to enable sharding. @@ -147,15 +146,20 @@ A :term:`shard cluster` has the following components: For testing purposes you may deploy a shard cluster with a single configuration server, but this is not recommended for production. + .. warning:: + + If you choose to run a single config server and it becomes inoperable + for any reason the cluster will be left in an unusable state. + - Two or more :program:`mongod` instances, to hold data. These are "normal," :program:`mongod` instances that hold all of the actual data for the cluster. - Typically a :term:`replica sets ` provides each - individual shard. The members of the replica set provide redundancy - for all data and increase the overall reliability and robustness of - the cluster. + Typically a :term:`replica sets ` consisting of multiple + :program:`mongod` instances compose a shard. The members of the replica + set provide redundancy for the data and increase the overall reliability + and robustness of the cluster. .. warning:: @@ -167,8 +171,8 @@ A :term:`shard cluster` has the following components: - One or more :program:`mongos` instances. These nodes cache cluster metadata from the config servers and - direct queries from the application layer to the :program:`mongod` - instances that hold the data. + direct queries from the application layer to the shards which hold + the data. .. note:: @@ -176,7 +180,7 @@ A :term:`shard cluster` has the following components: resources, and you can run them on your application servers without impacting application performance. However, if you use the :term:`aggregation framework` some processing may occur on - the :program:`mongos` instances which will cause them to require + the :program:`mongos` instances that will cause them to require more system resources. Data @@ -185,16 +189,16 @@ Data Your cluster must manage a significant quantity of data for sharding to have an effect on your collection. The default :term:`chunk` size is 64 megabytes, [#chunk-size]_ and the :ref:`balancer -` will not kick in until the shard with the -greatest number of chunks has *8 more* chunks than the shard with +` will not begin moving data until the shard with +the greatest number of chunks has *8 more* chunks than the shard with least number of chunks. Practically, this means that unless there is 512 megabytes of data, all of the data will remain on the same shard. You can set a smaller -chunk size, :ref:`manually create splits in your collection -` using the :func:`sh.splitFind()` or -:func:`sh.splitAt()` operations in the :program:`mongo` -shell. Remember that the default chunk size and migration threshold +chunk size, or :ref:`manually create splits in your collection +` using the :func:`sh.splitFind()` and +:func:`sh.splitAt()` operations in the :program:`mongo` shell. +Remember that the default chunk size and migration threshold are explicitly configured to prevent unnecessary splitting or migrations. @@ -204,7 +208,7 @@ complexity added by sharding is not worth the operational costs unless you need the additional concurrency/capacity for some reason. If you have a small data set, the chances are that a properly configured single MongoDB instance or replica set will be more than sufficient -for your data service needs. +for your persistence layer needs. .. [#chunk-size] While the default chunk size is 64 megabytes, the size is :option:`user configurable `. When @@ -234,17 +238,19 @@ documents among the :term:`shards`. Shard keys, like :term:`indexes`, can be either a single field, or may be a compound key, consisting of multiple fields. -Remember, MonoDB's sharding is range-based: each :term:`chunk` holds -documents with "shard key" within a specific range. Thus, choosing the -correct shard key can have a great impact on the performance, +Remember, MongoDB's sharding is range-based: each :term:`chunk` holds +documents having specific range of values for the "shard key". Thus, +choosing the correct shard key can have a great impact on the performance, capability, and of functioning your database and cluster. -Choosing a shard key depends on the schema of your data and the way -that your application uses the database. The ideal shard key: +Appropriate shard key choice depends on the schema of your data and the way +that your application uses the database (query patterns). + +The ideal shard key: - is easily divisible which makes it easy for MongoDB to distribute content among the shards. Shard keys that have a limited number of - possible values are un-ideal, as they can result in some shards that + possible values are un-ideal, as they can result in some chunks that are "un-splitable." See the ":ref:`sharding-shard-key-cardinality`" section for more information. @@ -258,17 +264,15 @@ that your application uses the database. The ideal shard key: - will make it possible for the :program:`mongos` to return most query operations directly from a single *specific* :program:`mongod` instance. Your shard key should be the primary field used by your - queries, and fields with a high amount of "randomness" are poor + queries, and fields with a high degree of "randomness" are poor choices for this reason. See the ":ref:`sharding-shard-key-query-isolation`" section for specific examples. -The challenge when selecting the shard key is that there is not always -an obvious shard key, and that it's unlikely that a single naturally -occurring field in your collection will satisfy all -requirements. Computing a special-purpose shard key in an additional -field, or using a compound shard key can help you find a more ideal -shard key, but choosing a shard key requires some degree of -compromise. +The challenge when selecting a shard key is that there is not always +an obvious choice. Often, a existing field in your collection may not be +the optimal key. In those situations, computing a special-purpose +shard key in an additional field or using a compound shard key may +help produce one that is more ideal. .. index:: sharding; config servers .. index:: config servers @@ -278,18 +282,18 @@ Config Servers -------------- The configuration servers store the shard metadata that tracks the -relationship between the range that defines a :term:`chunk` and the +relationship between the range which defines a :term:`chunk` and the :program:`mongod` instance (typically a :term:`replica set`) or :term:`shard` where that data resides. Without a config server, the -:program:`mongos` instances are unable to route queries and write -operations, and the cluster. This section describes their operation +:program:`mongos` instances would be unable to route queries or write +operations within the cluster. This section describes their operation and use. Config servers *do not* run as replica sets. Instead, a :term:`shard cluster` operates with a group of *three* config servers that use a two-phase commit process that ensures immediate consistency and reliability. Because the :program:`mongos` instances all maintain -caches of the config server data, the actual traffic on the config +caches of the config server data, the actual load on the config servers is small. MongoDB will write data to the config server only when: @@ -312,9 +316,9 @@ server data is only read in the following situations: If all three config servers are inaccessible, you can continue to use the cluster as long as you don't restart the :program:`mongos` -instances until the after config servers are accessible again. When -the :program:`mongos` instances restart and there are no accessible -config servers they are unable to direct queries or write operations +instances until the after config servers are accessible again. If +the :program:`mongos` instances were restarted and there are no accessible +config servers, they would be unable to direct queries or write operations to the cluster. Because the configuration data is small relative to the amount of data @@ -322,9 +326,8 @@ stored in a cluster, the amount activity is relatively low, and 100% up time is not required for a functioning shard cluster. As a result, backing up the config servers is not difficult. Backups of config servers are crucial as shard clusters become totally inoperable when -you loose all configuration instances and data, precautions to ensure -that the config servers remain available and intact are totally -crucial. +you loose all configuration instances and data. Precautions to ensure +that the config servers remain available and intact are critial. .. index:: mongos .. _sharding-mongos: @@ -342,17 +345,17 @@ The :program:`mongos` provides a single unified interface to a sharded cluster for applications using MongoDB. Except for the selection of a :term:`shard key`, application developers and administrators need not consider any of the :ref:`internal details of sharding -` because of the :program:`mongos`. +. :program:`mongos` caches data from the :ref:`config server -`, and uses this to route operations from the -applications and clients to the :program:`mongod` -instances. :program:`mongos` have no *persistent* state and consume +`, and uses this to route operations from +applications and clients to the :program:`mongod` instances. +:program:`mongos` have no *persistent* state and consume minimal system resources. The most common practice is to run :program:`mongos` instances on the same systems as your application servers, but you can maintain -:program:`mongos` instances on the shard or on other dedicated +:program:`mongos` instances on the shards or on other dedicated resources. .. note:: @@ -422,20 +425,20 @@ Balancing and Distribution -------------------------- Balancing refers to the process that MongoDB uses to redistribute data -among a :term:`shard cluster` when some :term:`shards ` have a +within a :term:`shard cluster` when some :term:`shards ` have a greater number of :term:`chunks ` than other shards. The balancing process attempts to minimize the impact that balancing can have on the cluster, by: - Only moving one chunk at a time. -- Only imitating a balancing round if there is a difference of *more - than* 8 chunks between the shard with the greatest and least number - of chunks. +- Only initiating a balancing round if there is a difference of *more + than* 8 chunks between the shard with the greatest and the shard with + the least number of chunks. Additionally, it's possible to disable the balancer on a temporary -basis for maintenance or to limit the window to prevent to prevent the -balancer process from interacting with production traffic. +basis for maintenance and limit the window during which it runs to +prevent the balancing process from interfering with production traffic. .. seealso:: The ":ref:`"Balancing Internals `" and :ref:`Balancer Operations