diff --git a/draft/faq-sharding-addition.txt b/draft/faq-sharding-addition.txt index 7d2986f1249..f6c0b60c08c 100644 --- a/draft/faq-sharding-addition.txt +++ b/draft/faq-sharding-addition.txt @@ -1,54 +1,48 @@ -What are the best ways to successfully insert larger volumes of data into as sharded collection? ------------------------------------------------------------------------------------------------- +For high-volume inserts, when is it necessary to first pre-split data? +---------------------------------------------------------------------- -- what is pre-splitting +Whether to pre-split before a high-volume insert depends on the +:term:`shard key`, the existing distribution of :term:`chunks `, +and how evenly distributed the insert operation is. - In sharded environments, MongoDB distributes data into :term:`chunks - `, each defined by a range of shard key values. Pre-splitting is a command run - prior to data insertion that specifies the shard key values on which to split up chunks. +In the following cases, we recommend pre-splitting before a large insert: -- Pre-splitting is useful before large inserts into a sharded collection when: +- Inserting data into an empty collection -1. inserting data into an empty collection + If a collection is empty, the database takes time to determine the + optimal key distribution. If you insert many documents in rapid + succession, MongoDB initially directs writes to a single chunk, which + can affect performance. Predefining splits improves write performance + in the early stages of a bulk import by eliminating the database's + "learning" period. -If a collection is empty, the database takes time to determine the optimal key -distribution. If you insert many documents in rapid succession, MongoDB will initially -direct writes to a single chunk, potentially having significant impacts on performance. -Predefining splits may improve write performance in the early stages of a bulk import by -eliminating the database's "learning" period. +- Data is not evenly distributed -2. data is not evenly distributed + Even if the sharded collection contains existing documents balanced + over multiple chunks, :term:`pre-splitting` is beneficial if the write + operation itself isn't evenly distributed, i.e., if the inserts + include shard-key values that are contained on only a small number of + chunks. By pre-splitting and using an increasing shard key, you can + prevent writes from monopolizing a single :term:`shard`. -Even if the sharded collection was previously populated with documents and contains multiple -chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in -other words, if the inserts have shard keys values contained on a small number of chunks. +- Monotomically increasing shard key. -3. monotomically increasing shard key + If you attempt to insert data with monotonically increasing shard + keys, the writes will always occur on the last chunk in the + collection. Predefining splits helps to cycle a large write operation + around the cluster; however, pre-splitting in this instance will not + prevent consecutive inserts from hitting a single shard. -If you attempt to insert data with monotonically increasing shard keys, the writes will -always hit the last chunk in the collection. Predefining splits may help to cycle a large -write operation around the cluster; however, pre-splitting in this instance will not -prevent consecutive inserts from hitting a single shard. +Pre-splitting might *not* be necessary in the following cases: -- when does it not matter +- If data insertion is not rapid, MongoDB may have enough time to split + and migrate chunks without affecting performance. -If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without -impacts on performance. In addition, if the collection already has chunks with an even key -distribution, pre-splitting may not be necessary. +- If the collection already has chunks with an even key distribution, + pre-splitting may not be necessary. -See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more -information. +For more information, see :doc:`/tutorial/inserting-documents-into-a-sharded-collection`. - -Is it necessary to pre-split data before high volume inserts into a sharded collection? ---------------------------------------------------------------------------------------- - -The answer depends on the shard key, the existing distribution of chunks, and how -evenly distributed the insert operation is. If a collection is empty prior to a -bulk insert, the database will take time to determine the optimal key -distribution. Predefining splits improves write performance in the early stages -of a bulk import. - -Pre-splitting is also important if the write operation isn't evenly distributed. -When using an increasing shard key, for example, pre-splitting data can prevent -writes from hammering a single shard. +.. SK, I flipped the above sentence, which could instead read: +.. See :doc:`/tutorial/inserting-documents-into-a-sharded-collection` for more information. +.. I prefer the former, but I think you prefer the latter. Let me know. -BG diff --git a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt index d0e751c7bef..d620103f068 100644 --- a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -2,160 +2,260 @@ Inserting Documents into a Sharded Collection ============================================= -Shard Keys ----------- +Synopsis +-------- + +When inserting documents into a :term:`sharded ` +:term:`collection`, you must consider how MongoDB will distribute the +inserted data and how the distribution will affect performance. You must +also consider whether you need first to :term:`pre-split +` the data. This document describes types of distribution +and how to pre-split data. + +.. seealso:: + + - :ref:`chunk-management` + - :doc:`/core/sharding-internals` + - :doc:`/core/sharding` + +.. _sharding-distribution-types: -.. TODO +Types of Distribution +--------------------- - outline the ways that insert operations work given shard keys of the following types +MongoDB distributes inserted data in one of three ways, depending on +the following factors: the distribution of the :term:`collection's ` +existing :term:`chunks `, the existence of +:term:`shard keys ` on the write operation (i.e., whether +data is :term:`pre-split `), the distribution of the +inserted data, and the volume of the inserted data. +Depending on the above, MongoDB distributes inserted data in one of the +following ways: +- MongoDB distributes the data evenly around the cluster. For details + see :ref:`sharding-even-distribution`. +- MongoDB distributes data unevenly around the cluster. For details see + :ref:`sharding-uneven-distribution`. + +- MongoDB inserts all data into the last chunk in the cluster. For + details see :ref:`sharding-monotonic-distribution`. + +.. _sharding-even-distribution: Even Distribution ~~~~~~~~~~~~~~~~~ -If the data's distribution of keys is evenly, MongoDB should be able to distribute writes -evenly around a the cluster once the chunk key ranges are established. MongoDB will -automatically split chunks when they grow to a certain size (~64 MB by default) and will -balance the number of chunks across shards. +In even distribution, MongoDB balances writes among all :term:`chunks ` +and :term:`shards` in the cluster. Even distribution provides the best +performance and occurs in the following cases: + +- The insert data has been :term:`pre-split ` by specifying + :term:`shard keys ` in the write operation. When you insert + the data, MongoDB uses the keys to distribute writes evenly. It does + not matter whether the existing :term:`collection` is evenly distributed. For + details on pre-splitting data, see the :ref:`sharding-pre-splitting` + section below. + +- The :term:`sharded ` collection contains existing documents + balanced over multiple chunks *and* the inserted data is either low + volume or already evenly distributed. If the data is large and + unevenly distributed, the write operation becomes imbalanced and + monopolizes certain chunks. -When inserting data into a new collection, it may be important to pre-split the key ranges. -See the section below on pre-splitting and manually moving chunks. +.. _sharding-uneven-distribution: Uneven Distribution ~~~~~~~~~~~~~~~~~~~ -If you need to insert data into a collection that isn't evenly distributed, or if the shard -keys of the data being inserted aren't evenly distributed, you may need to pre-split your -data before doing high volume inserts. As an example, consider a collection sharded by last -name with the following key distribution: +In uneven distribution, MongoDB focuses write operations on a small +number of :term:`chunk ` instead of balancing writes across +chunks. This increases the likelihood that chunks will fill and that +MongoDB must move chunks between :term:`shards`, an operation that slows +performance. -(-∞, "Henri") -["Henri", "Peters") -["Peters",+∞) +To avoid uneven distribution, :term:`pre-split` your data, as described +in the :ref:`sharding-pre-splitting` section below. -Although the chunk ranges may be split evenly, inserting lots of users with with a common -last name such as "Smith" will potentially hammer a single shard. Making the chunk range -more granular in this portion of the alphabet may improve write performance. +Uneven distribution occurs in the following cases: -(-∞, "Henri") -["Henri", "Peters") -["Peters", "Smith"] -["Smith", "Tyler"] -["Tyler",+∞) +- You insert a large volume of data that is not evenly distributed. Even + if the :term:`sharded cluster ` contains existing + documents balanced over multiple chunks, the inserted data might + include values that write disproportionately to a small number of + chunks. -Monotonically Increasing Values -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- The collection is empty, *and* the inserted data is not evenly + distributed. MongoDB fills one chunk before creating the next and + eventually must rebalance and move chunks between shards, slowing + performance. -.. what will happen if you try to do inserts. +- Neither the collection nor the inserted data are evenly distributed. + MongoDB fills certain chunks too soon and eventually must rebalance + and move chunks between shards, slowing performance. -Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always -be inserted into the last chunk in a collection. To illustrate why, consider a sharded -collection with two chunks, the second of which has an unbounded upper limit. +.. _sharding-monotonic-distribution: -(-∞, 100) -[100,+∞) +All Writes Go to Last Chunk (Monotonic) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If the data being inserted has an increasing key, at any given time writes will always hit -the shard containing the chunk with the unbounded upper limit, a problem that is not -alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the -cluster's performance by placing a significant load on a single shard. +When you insert documents with monotonically increasing +:term:`shard keys `, such as the BSON ObjectID, MongoDB +inserts all data into the last :term:`chunk` in the collection. -If, however, a single shard can handle the write volume, an increasing shard key may have -some advantages. For example, if you need to do queries based on document insertion time, -sharding on the ObjectID ensures that documents created around the same time exist on the -same shard. Data locality helps to improve query performance. +For example, consider a :term:`sharded ` collection with +two chunks, the second of which has an unbounded upper limit. The "(" +and ")" symbols indicate a non-inclusive value. The "[" and "]" symbols +indicate an inclusive value: -If you decide to use an monotonically increasing shard key and anticipate large inserts, -one solution may be to store the hash of the shard key as a separate field. Hashing may -prevent the need to balance chunks by distributing data equally around the cluster. You can -create a hash client-side. In the future, MongoDB may support automatic hashing: -https://jira.mongodb.org/browse/SERVER-2001 +.. code-block:: sh -Operations ----------- + (-infinity, 100) + [100, +infinity) + +If the data being inserted has an increasing key, then writes always +direct to the shard containing the chunk with the unbounded upper limit. + +Inserting to the last chunk can hinder the cluster's performance by placing a +significant load on a single :term:`shard `. + +If, however, a single shard can handle the write volume, an increasing +shard key may have some advantages. For example, if you need to do +queries based on document-insertion time, sharding on the ObjectID +ensures that documents created around the same time exist on the same +shard. Data locality helps to improve query performance. + +If you decide to use a monotonically increasing shard key and +anticipate large inserts, one solution may be to store the hash of the +shard key as a separate field. Hashing may prevent the need to balance +chunks by distributing data equally around the cluster. You can create a +hash client-side. -.. TODO +.. _sharding-pre-splitting: - outline the procedures and rationale for each process. +Operation +--------- Pre-Splitting ~~~~~~~~~~~~~ -.. when to do this -.. procedure for this process +Pre-splitting is the process of specifying :term:`shard key` ranges for +:term:`chunks ` prior to data insertion in order to ensure +MongoDB distributes the data evenly around the cluster. You should +consider pre-splitting if: -Pre-splitting is the process of specifying shard key ranges for chunks prior to data -insertion. This process may be important prior to large inserts into a sharded collection -as a way of ensuring that the write operation is evenly spread around the cluster. You -should consider pre-splitting before large inserts if the sharded collection is empty, if -the collection's data or the data being inserted is not evenly distributed, or if the shard -key is monotonically increasing. +- You are doing high volume inserts. -In the example below the pre-split command splits the chunk where the _id 99 would reside -using that key as the split point. Note that a key need not exist for a chunk to use it in -its range. The chunk may even be empty. +- The :term:`sharded ` collection is empty. -The first step is to create a sharded collection to contain the data, which can be done in -three steps: +- Either the collection's data or the data being inserted is not evenly distributed. -> use admin -> db.runCommand({ enableSharding : "foo" }) +- The shard key is monotonically increasing. -Next, we add a unique index to the collection "foo.bar" which is required for the shard -key. +For more information on how MongoDB distributes inserted data, see +:ref:`sharding-distribution-types`. -> use foo -> db.bar.ensureIndex({ _id : 1 }, { unique : true }) +As an example of when you might choose to pre-split, consider a +collection sharded by last name with the following key distribution. The +"(" and ")" symbols indicate a non-inclusive value. The "[" and "]" +symbols indicate an inclusive value: -Finally we shard the collection (which contains no data) using the _id value. +.. code-block:: sh -> use admin -switched to db admin -> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) + ["A", "Jones") + ["Jones", "Smith") + ["Smith", "zzz") -Once the key range is specified, chunks can be moved around the cluster using the moveChunk -command. +Although the chunk ranges in the collection may be evenly split, +inserting a large number of new users with a common last name, such as +"Smith" or "Jones", will write disproportionately to a single shard, +monopolizing writes and reads to that shard. -> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) +To avoid this situation, you would make the chunk range more granular: -You can repeat these steps as many times as necessary to create or move chunks around the -cluster. To get information about the two chunks created in this example: +.. code-block:: sh -> db.printShardingStatus() ---- Sharding Status --- - sharding version: { "_id" : 1, "version" : 3 } - shards: - { "_id" : "shard0000", "host" : "localhost:30000" } - { "_id" : "shard0001", "host" : "localhost:30001" } - databases: - { "_id" : "admin", "partitioned" : false, "primary" : "config" } - { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } - test.foo chunks: - shard0001 1 - shard0000 1 - { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } - { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } + ["A", "Jones") + ["Jones", "Parker") + ["Parker", "Smith") + ["Smith", "Tyler") + ["Tyler", "zzz"] + +Procedures +---------- + +Pre-Splitting +~~~~~~~~~~~~~ + +This procedure assumes: + +- You have a database "foo". + +- The "foo" database has a collection "bar". + +- The "bar" collection has a unique index. + +- The "bar" collection, which contains no data, is sharded on _id : 99. + +Note that a key need not exist for a chunk to use it in its range. The +chunk may even be empty. + +Once the key range is specified, chunks can be moved around the cluster +using the moveChunk command. + +.. code-block:: javascript + + db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) + +You can repeat these steps as many times as necessary to create or move +chunks around the cluster. To get information about the two chunks +created in this example: + +.. code-block:: javascript + + db.printShardingStatus() + --- Sharding Status --- + sharding version: { "_id" : 1, "version" : 3 } + shards: + { "_id" : "shard0000", "host" : "localhost:30000" } + { "_id" : "shard0001", "host" : "localhost:30001" } + databases: + { "_id" : "admin", "partitioned" : false, "primary" : "config" } + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } + test.foo chunks: + shard0001 1 + shard0000 1 + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } Once the chunks and the key ranges are evenly distributed, you can proceed with a high volume insert. -Changing Shard Key +.. _sharding-changing-shard-key: + +Changing Shard Key ~~~~~~~~~~~~~~~~~~ -There is no automatic support for changing the shard key for a collection. In addition, -since a document's location within the cluster is determined by its shard key value, -changing the shard key could force data to move from machine to machine, potentially a -highly expensive operation. +There is no automatic support for changing the shard key for a +collection. In addition, since a document's location within the cluster +is determined by its shard key value, changing the shard key could force +data to move from machine to machine, potentially a highly expensive +operation. Thus it is very important to choose the right shard key up front. -If you do need to change a shard key, an export and import is likely the best solution. -Create a new pre-sharded collection, and then import the exported data to it. If desired -use a dedicated mongos for the export and the import. +If you do need to change a shard key, an export and import is likely the +best solution. Create a new pre-sharded collection, and then import the +exported data to it. If desired use a dedicated :program:`mongos` for +the export and the import. + +.. :issue:`SERVER-4000` -https://jira.mongodb.org/browse/SERVER-4000 +.. _sharding-pre-allocating-documents: Pre-allocating Documents ~~~~~~~~~~~~~~~~~~~~~~~~ + +.. http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/#pre-allocate +