From ba2a824965fb5b277e622cb46fe376fcc9fa26ad Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Wed, 29 Aug 2012 14:01:31 -0400 Subject: [PATCH 1/4] DOCS-458 addition for faq sharding --- draft/faq-sharding-addition.txt | 76 +++++++++++++++------------------ 1 file changed, 35 insertions(+), 41 deletions(-) diff --git a/draft/faq-sharding-addition.txt b/draft/faq-sharding-addition.txt index 7d2986f1249..f6c0b60c08c 100644 --- a/draft/faq-sharding-addition.txt +++ b/draft/faq-sharding-addition.txt @@ -1,54 +1,48 @@ -What are the best ways to successfully insert larger volumes of data into as sharded collection? ------------------------------------------------------------------------------------------------- +For high-volume inserts, when is it necessary to first pre-split data? +---------------------------------------------------------------------- -- what is pre-splitting +Whether to pre-split before a high-volume insert depends on the +:term:`shard key`, the existing distribution of :term:`chunks `, +and how evenly distributed the insert operation is. - In sharded environments, MongoDB distributes data into :term:`chunks - `, each defined by a range of shard key values. Pre-splitting is a command run - prior to data insertion that specifies the shard key values on which to split up chunks. +In the following cases, we recommend pre-splitting before a large insert: -- Pre-splitting is useful before large inserts into a sharded collection when: +- Inserting data into an empty collection -1. inserting data into an empty collection + If a collection is empty, the database takes time to determine the + optimal key distribution. If you insert many documents in rapid + succession, MongoDB initially directs writes to a single chunk, which + can affect performance. Predefining splits improves write performance + in the early stages of a bulk import by eliminating the database's + "learning" period. -If a collection is empty, the database takes time to determine the optimal key -distribution. If you insert many documents in rapid succession, MongoDB will initially -direct writes to a single chunk, potentially having significant impacts on performance. -Predefining splits may improve write performance in the early stages of a bulk import by -eliminating the database's "learning" period. +- Data is not evenly distributed -2. data is not evenly distributed + Even if the sharded collection contains existing documents balanced + over multiple chunks, :term:`pre-splitting` is beneficial if the write + operation itself isn't evenly distributed, i.e., if the inserts + include shard-key values that are contained on only a small number of + chunks. By pre-splitting and using an increasing shard key, you can + prevent writes from monopolizing a single :term:`shard`. -Even if the sharded collection was previously populated with documents and contains multiple -chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in -other words, if the inserts have shard keys values contained on a small number of chunks. +- Monotomically increasing shard key. -3. monotomically increasing shard key + If you attempt to insert data with monotonically increasing shard + keys, the writes will always occur on the last chunk in the + collection. Predefining splits helps to cycle a large write operation + around the cluster; however, pre-splitting in this instance will not + prevent consecutive inserts from hitting a single shard. -If you attempt to insert data with monotonically increasing shard keys, the writes will -always hit the last chunk in the collection. Predefining splits may help to cycle a large -write operation around the cluster; however, pre-splitting in this instance will not -prevent consecutive inserts from hitting a single shard. +Pre-splitting might *not* be necessary in the following cases: -- when does it not matter +- If data insertion is not rapid, MongoDB may have enough time to split + and migrate chunks without affecting performance. -If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without -impacts on performance. In addition, if the collection already has chunks with an even key -distribution, pre-splitting may not be necessary. +- If the collection already has chunks with an even key distribution, + pre-splitting may not be necessary. -See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more -information. +For more information, see :doc:`/tutorial/inserting-documents-into-a-sharded-collection`. - -Is it necessary to pre-split data before high volume inserts into a sharded collection? ---------------------------------------------------------------------------------------- - -The answer depends on the shard key, the existing distribution of chunks, and how -evenly distributed the insert operation is. If a collection is empty prior to a -bulk insert, the database will take time to determine the optimal key -distribution. Predefining splits improves write performance in the early stages -of a bulk import. - -Pre-splitting is also important if the write operation isn't evenly distributed. -When using an increasing shard key, for example, pre-splitting data can prevent -writes from hammering a single shard. +.. SK, I flipped the above sentence, which could instead read: +.. See :doc:`/tutorial/inserting-documents-into-a-sharded-collection` for more information. +.. I prefer the former, but I think you prefer the latter. Let me know. -BG From 8ecece2593ca78a597b93e3bd42b8dc864b7a7d4 Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Wed, 29 Aug 2012 17:56:59 -0400 Subject: [PATCH 2/4] DOCS-458 ongoing edits to pre-splitting doc --- ...ng-documents-into-a-sharded-collection.txt | 234 ++++++++++++------ 1 file changed, 152 insertions(+), 82 deletions(-) diff --git a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt index d0e751c7bef..5caa0a26e95 100644 --- a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -2,83 +2,112 @@ Inserting Documents into a Sharded Collection ============================================= -Shard Keys ----------- +Types of Distribution +--------------------- -.. TODO +MongoDB distributes inserted data in one of three ways, depending on a +combination of :term:`shard keys `, the distribution of +existing :term:`chunks `, and the distribution and volume of the +inserted data. MongoDB distributes data on one of the following ways: - outline the ways that insert operations work given shard keys of the following types +- MongoDB distributes the data evenly around the cluster. For details + see :ref:`sharding-even-distribution`. +- MongoDB directs writes unevenly or directs all writes to a single + chunk. For details see :ref:`sharding-uneven-distribution`. +- MongoDB inserts the data into the last chunk in the cluster. For + details see :ref:`sharding-monotonically-increasing-keys`. +.. _sharding-even-distribution: Even Distribution ~~~~~~~~~~~~~~~~~ -If the data's distribution of keys is evenly, MongoDB should be able to distribute writes -evenly around a the cluster once the chunk key ranges are established. MongoDB will -automatically split chunks when they grow to a certain size (~64 MB by default) and will -balance the number of chunks across shards. +MongoDB distributes writes evenly around the cluster in the following +cases: + +- Data has been pre-split with evenly distributed :term:`shard keys `, + which MongoDB then establishes and then uses to distribute writes + evenly. For details on pre-splitting data with evenly distributed + keys, see the :ref:`sharding-pre-splitting` section below. + +- The sharded collection contains existing documents balanced over + multiple chunks *and* the inserted data is either low volume or evenly + distributed. + +In even distributions, MongoDB automatically splits chunks when they +grow to a certain size (~64 MB by default) and automatically balances +chunks across shards, allowing no more than eight chunks on a shard. -When inserting data into a new collection, it may be important to pre-split the key ranges. -See the section below on pre-splitting and manually moving chunks. +.. _sharding-uneven-distribution: Uneven Distribution ~~~~~~~~~~~~~~~~~~~ -If you need to insert data into a collection that isn't evenly distributed, or if the shard -keys of the data being inserted aren't evenly distributed, you may need to pre-split your -data before doing high volume inserts. As an example, consider a collection sharded by last -name with the following key distribution: +MongoDB writes data unevenly in the following cases. To avoid each of +these cases, pre-split your data, as described the +:ref:`sharding-pre-splitting` section below: -(-∞, "Henri") -["Henri", "Peters") -["Peters",+∞) +- You insert a large volume of data that isn't evenly distributed. In + this case, even if the sharded cluster contains existing documents + balanced over multiple chunks, the inserts might include values that + are contained only on a small number of chunks. -Although the chunk ranges may be split evenly, inserting lots of users with with a common -last name such as "Smith" will potentially hammer a single shard. Making the chunk range -more granular in this portion of the alphabet may improve write performance. +- The collection is empty, and the data is not evenly distributed. + MongoDB fills one chunk before creating the next and maximizes the + number of chunks on a shard (8), before creating a new shard. -(-∞, "Henri") -["Henri", "Peters") -["Peters", "Smith"] -["Smith", "Tyler"] -["Tyler",+∞) +- Neither the collection nor the data are evenly distributed. MongoDB + might write to chunks unevenly. -Monotonically Increasing Values -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- The sharded collection contains existing documents balanced over + multiple chunks *and* the inserted data is either low volume or evenly + distributed. -.. what will happen if you try to do inserts. -Documents with monotonically increasing shard keys, such as the BSON ObjectID, will always -be inserted into the last chunk in a collection. To illustrate why, consider a sharded -collection with two chunks, the second of which has an unbounded upper limit. +.. _sharding-monotonically-increasing-keys: -(-∞, 100) -[100,+∞) - -If the data being inserted has an increasing key, at any given time writes will always hit -the shard containing the chunk with the unbounded upper limit, a problem that is not -alleviated by splitting the "hot" chunk. High volume inserts, therefore, could hinder the -cluster's performance by placing a significant load on a single shard. +Monotonically Increasing Values +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If, however, a single shard can handle the write volume, an increasing shard key may have -some advantages. For example, if you need to do queries based on document insertion time, -sharding on the ObjectID ensures that documents created around the same time exist on the -same shard. Data locality helps to improve query performance. +.. what will happen if you try to do inserts. -If you decide to use an monotonically increasing shard key and anticipate large inserts, -one solution may be to store the hash of the shard key as a separate field. Hashing may -prevent the need to balance chunks by distributing data equally around the cluster. You can -create a hash client-side. In the future, MongoDB may support automatic hashing: +Documents with monotonically increasing shard keys, such as the BSON +ObjectID, will always be inserted into the last chunk in a collection. +To illustrate why, consider a sharded collection with two chunks, the +second of which has an unbounded upper limit. + +(-∞, 100) +[100, +∞) + +If the data being inserted has an increasing key, at any given time +writes will always hit the shard containing the chunk with the unbounded +upper limit, a problem that is not alleviated by splitting the "hot" +chunk. High volume inserts, therefore, could hinder the cluster's +performance by placing a significant load on a single shard. + +If, however, a single shard can handle the write volume, an increasing +shard key may have some advantages. For example, if you need to do +queries based on document insertion time, sharding on the ObjectID +ensures that documents created around the same time exist on the same +shard. Data locality helps to improve query performance. + +If you decide to use an monotonically increasing shard key and +anticipate large inserts, one solution may be to store the hash of the +shard key as a separate field. Hashing may prevent the need to balance +chunks by distributing data equally around the cluster. You can create a +hash client-side. In the future, MongoDB may support automatic hashing: https://jira.mongodb.org/browse/SERVER-2001 Operations ---------- -.. TODO +.. TODO + +.. outline the procedures and rationale for each process. - outline the procedures and rationale for each process. +.. _sharding-pre-splitting: Pre-Splitting ~~~~~~~~~~~~~ @@ -86,42 +115,74 @@ Pre-Splitting .. when to do this .. procedure for this process -Pre-splitting is the process of specifying shard key ranges for chunks prior to data -insertion. This process may be important prior to large inserts into a sharded collection -as a way of ensuring that the write operation is evenly spread around the cluster. You -should consider pre-splitting before large inserts if the sharded collection is empty, if -the collection's data or the data being inserted is not evenly distributed, or if the shard -key is monotonically increasing. +Pre-splitting is the process of specifying shard key ranges for chunks +prior to data insertion. Pre-splitting ensures the write operation is +evenly spread around the cluster. You should consider pre-splitting if: + +- You are doing high volume inserts. -In the example below the pre-split command splits the chunk where the _id 99 would reside -using that key as the split point. Note that a key need not exist for a chunk to use it in -its range. The chunk may even be empty. +- The sharded collection is empty. -The first step is to create a sharded collection to contain the data, which can be done in -three steps: +- Either the collection's data or the data being inserted is not evenly distributed. + +- The shard key is monotonically increasing. + +As an example of pre-splitting, consider a collection sharded by last +name with the following key distribution. The "(" and ")" symbols +indicate a non-inclusive value. The "[" and "]" symbols indicate an +inclusive value: + +["A", "Jones") +["Jones", "Smith") +["Smith", "zzz") + +Although the chunk ranges may be split evenly, inserting lots of users +with with a common last name, such as "Jones" or "Smith", will +potentially monopolize a single shard. Making the chunk range more +granular in these portions of the alphabet may improve write +performance. + +["A", "Jones") +["Jones", "Peters") +["Peters", "Smith") +["Smith", "Tyler") +["Tyler", "zzz"] + +Procedure +--------- + +In the example below the pre-split command splits the chunk where the +_id 99 would reside using that key as the split point. Note that a key +need not exist for a chunk to use it in its range. The chunk may even be +empty. + +The first step is to create a sharded collection to contain the data, +which can be done in three steps: > use admin > db.runCommand({ enableSharding : "foo" }) -Next, we add a unique index to the collection "foo.bar" which is required for the shard -key. +Next, we add a unique index to the collection "foo.bar" which is +required for the shard key. > use foo > db.bar.ensureIndex({ _id : 1 }, { unique : true }) -Finally we shard the collection (which contains no data) using the _id value. +Finally we shard the collection (which contains no data) using the _id +value. > use admin switched to db admin > db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) -Once the key range is specified, chunks can be moved around the cluster using the moveChunk -command. +Once the key range is specified, chunks can be moved around the cluster +using the moveChunk command. > db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) -You can repeat these steps as many times as necessary to create or move chunks around the -cluster. To get information about the two chunks created in this example: +You can repeat these steps as many times as necessary to create or move +chunks around the cluster. To get information about the two chunks +created in this example: > db.printShardingStatus() --- Sharding Status --- @@ -130,32 +191,41 @@ cluster. To get information about the two chunks created in this example: { "_id" : "shard0000", "host" : "localhost:30000" } { "_id" : "shard0001", "host" : "localhost:30001" } databases: - { "_id" : "admin", "partitioned" : false, "primary" : "config" } - { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } - test.foo chunks: - shard0001 1 - shard0000 1 - { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } - { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } + { "_id" : "admin", "partitioned" : false, "primary" : "config" } + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } + test.foo chunks: + shard0001 1 + shard0000 1 + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } Once the chunks and the key ranges are evenly distributed, you can proceed with a high volume insert. -Changing Shard Key +.. _sharding-changing-shard-key: + +Changing Shard Key ~~~~~~~~~~~~~~~~~~ -There is no automatic support for changing the shard key for a collection. In addition, -since a document's location within the cluster is determined by its shard key value, -changing the shard key could force data to move from machine to machine, potentially a -highly expensive operation. +There is no automatic support for changing the shard key for a +collection. In addition, since a document's location within the cluster +is determined by its shard key value, changing the shard key could force +data to move from machine to machine, potentially a highly expensive +operation. Thus it is very important to choose the right shard key up front. -If you do need to change a shard key, an export and import is likely the best solution. -Create a new pre-sharded collection, and then import the exported data to it. If desired -use a dedicated mongos for the export and the import. +If you do need to change a shard key, an export and import is likely the +best solution. Create a new pre-sharded collection, and then import the +exported data to it. If desired use a dedicated mongos for the export +and the import. -https://jira.mongodb.org/browse/SERVER-4000 +.. :issue:`SERVER-4000` + +.. _sharding-pre-allocating-documents: Pre-allocating Documents ~~~~~~~~~~~~~~~~~~~~~~~~ + +.. http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/#pre-allocate + From 05ac055e19e048801aaccdb86a8a6048f288a5c2 Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Thu, 30 Aug 2012 13:43:35 -0400 Subject: [PATCH 3/4] DOCS-458 pre-split, ongoing edits --- ...ng-documents-into-a-sharded-collection.txt | 138 ++++++++++-------- 1 file changed, 80 insertions(+), 58 deletions(-) diff --git a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt index 5caa0a26e95..b57fc3dc4e2 100644 --- a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -2,109 +2,131 @@ Inserting Documents into a Sharded Collection ============================================= +Synopsis +-------- + +When inserting documents into a :term:`sharded ` +:term:`collection`, you must consider how MongoDB will distribute the +inserted data and how the distribution will affect performance. You must +also consider whether you need first to :term:`pre-split +` the data. This document describes types of distribution +and how to pre-split data. + Types of Distribution --------------------- -MongoDB distributes inserted data in one of three ways, depending on a -combination of :term:`shard keys `, the distribution of -existing :term:`chunks `, and the distribution and volume of the -inserted data. MongoDB distributes data on one of the following ways: +MongoDB distributes inserted data in one of three ways, depending on +the following factors: the distribution of the :term:`collection's ` +existing :term:`chunks `, the existence of +:term:`shard keys ` on the write operation (i.e., whether +data is :term:`pre-split `), the distribution of the +inserted data, and the volume of the inserted data. + +Depending on the above, MongoDB distributes inserted data in one of the +following ways: - MongoDB distributes the data evenly around the cluster. For details see :ref:`sharding-even-distribution`. -- MongoDB directs writes unevenly or directs all writes to a single - chunk. For details see :ref:`sharding-uneven-distribution`. +- MongoDB distributes data unevenly around the cluster. For details see + :ref:`sharding-uneven-distribution`. -- MongoDB inserts the data into the last chunk in the cluster. For - details see :ref:`sharding-monotonically-increasing-keys`. +- MongoDB inserts all data into the last chunk in the cluster. For + details see :ref:`sharding-monotonic-distribution`. .. _sharding-even-distribution: Even Distribution ~~~~~~~~~~~~~~~~~ -MongoDB distributes writes evenly around the cluster in the following -cases: - -- Data has been pre-split with evenly distributed :term:`shard keys `, - which MongoDB then establishes and then uses to distribute writes - evenly. For details on pre-splitting data with evenly distributed - keys, see the :ref:`sharding-pre-splitting` section below. +In even distribution, MongoDB balances writes among all :term:`chunks ` +and :term:`shards` in the cluster. Even distribution provides the best +performance and occurs in the following cases: -- The sharded collection contains existing documents balanced over - multiple chunks *and* the inserted data is either low volume or evenly - distributed. +- The insert data has been :term:`pre-split ` by specifying + :term:`shard keys ` in the write operation. When you insert + the data, MongoDB uses the keys to distribute writes evenly. It does + not matter whether the existing :term:`collection` is evenly distributed. For + details on pre-splitting data, see the :ref:`sharding-pre-splitting` + section below. -In even distributions, MongoDB automatically splits chunks when they -grow to a certain size (~64 MB by default) and automatically balances -chunks across shards, allowing no more than eight chunks on a shard. +- The :term:`sharded ` collection contains existing documents + balanced over multiple chunks *and* the inserted data is either low + volume or already evenly distributed. If the data is large and + unevenly distributed, the write operation becomes imbalanced and + monopolizes certain chunks. .. _sharding-uneven-distribution: Uneven Distribution ~~~~~~~~~~~~~~~~~~~ -MongoDB writes data unevenly in the following cases. To avoid each of -these cases, pre-split your data, as described the -:ref:`sharding-pre-splitting` section below: +In uneven distribution, MongoDB focuses write operations on a small +number of :term:`chunk ` instead of balancing writes across +chunks. This increases the likelihood that chunks will fill and that +MongoDB must move chunks between :term:`shards`, an operation that slows +performance. + +To avoid uneven distribution, :term:`pre-split` your data, as described +in the :ref:`sharding-pre-splitting` section below. -- You insert a large volume of data that isn't evenly distributed. In - this case, even if the sharded cluster contains existing documents - balanced over multiple chunks, the inserts might include values that - are contained only on a small number of chunks. +Uneven distribution occurs in the following cases: -- The collection is empty, and the data is not evenly distributed. - MongoDB fills one chunk before creating the next and maximizes the - number of chunks on a shard (8), before creating a new shard. +- You insert a large volume of data that is not evenly distributed. Even + if the :term:`sharded cluster ` contains existing + documents balanced over multiple chunks, the inserted data might + include values that write disproportionately to a small number of + chunks. -- Neither the collection nor the data are evenly distributed. MongoDB - might write to chunks unevenly. +- The collection is empty, *and* the inserted data is not evenly + distributed. MongoDB fills one chunk before creating the next and + eventually must rebalance and move chunks between shards. Moving + chunks between shards affects performance. -- The sharded collection contains existing documents balanced over - multiple chunks *and* the inserted data is either low volume or evenly - distributed. +- Neither the collection nor the inserted data are evenly distributed. + MongoDB fills certain chunks too soon and eventually must rebalance + and move chunks between shards. Moving chunks between shards affects + performance. +.. _sharding-monotonic-distribution: -.. _sharding-monotonically-increasing-keys: +All Inserts are to Last Chunk (Monotonic) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Monotonically Increasing Values -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +When you insert documents with monotonically increasing +:term:`shard keys `, such as the BSON ObjectID, MongoDB +inserts all data into the last :term:`chunk` in the collection. -.. what will happen if you try to do inserts. +For an example, consider a :term:`sharded ` collection with +two chunks, the second of which has an unbounded upper limit. The "(" +and ")" symbols indicate a non-inclusive value. The "[" and "]" symbols +indicate an inclusive value: -Documents with monotonically increasing shard keys, such as the BSON -ObjectID, will always be inserted into the last chunk in a collection. -To illustrate why, consider a sharded collection with two chunks, the -second of which has an unbounded upper limit. +(-infinity, 100) +[100, +infinity) -(-∞, 100) -[100, +∞) +If the data being inserted has an increasing key, then writes always +direct to the shard containing the chunk with the unbounded upper limit. -If the data being inserted has an increasing key, at any given time -writes will always hit the shard containing the chunk with the unbounded -upper limit, a problem that is not alleviated by splitting the "hot" -chunk. High volume inserts, therefore, could hinder the cluster's -performance by placing a significant load on a single shard. +Inserting to the last chunk can hinder the cluster's performance by placing a +significant load on a single :term:`shard `. If, however, a single shard can handle the write volume, an increasing shard key may have some advantages. For example, if you need to do -queries based on document insertion time, sharding on the ObjectID +queries based on document-insertion time, sharding on the ObjectID ensures that documents created around the same time exist on the same shard. Data locality helps to improve query performance. -If you decide to use an monotonically increasing shard key and +If you decide to use a monotonically increasing shard key and anticipate large inserts, one solution may be to store the hash of the shard key as a separate field. Hashing may prevent the need to balance chunks by distributing data equally around the cluster. You can create a -hash client-side. In the future, MongoDB may support automatic hashing: -https://jira.mongodb.org/browse/SERVER-2001 +hash client-side. Operations ---------- .. TODO - .. outline the procedures and rationale for each process. .. _sharding-pre-splitting: @@ -112,10 +134,10 @@ Operations Pre-Splitting ~~~~~~~~~~~~~ -.. when to do this +.. TODO .. procedure for this process -Pre-splitting is the process of specifying shard key ranges for chunks +Pre-splitting is the process of specifying shard key ranges for :term:`chunks ` prior to data insertion. Pre-splitting ensures the write operation is evenly spread around the cluster. You should consider pre-splitting if: From 183d102176d350c02328ca0973cf3a4cb240f08a Mon Sep 17 00:00:00 2001 From: Bob Grabar Date: Thu, 30 Aug 2012 17:52:10 -0400 Subject: [PATCH 4/4] DOCS-458 draft pre-splitting document --- ...ng-documents-into-a-sharded-collection.txt | 158 +++++++++--------- 1 file changed, 83 insertions(+), 75 deletions(-) diff --git a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt index b57fc3dc4e2..d620103f068 100644 --- a/draft/tutorial/inserting-documents-into-a-sharded-collection.txt +++ b/draft/tutorial/inserting-documents-into-a-sharded-collection.txt @@ -12,6 +12,14 @@ also consider whether you need first to :term:`pre-split ` the data. This document describes types of distribution and how to pre-split data. +.. seealso:: + + - :ref:`chunk-management` + - :doc:`/core/sharding-internals` + - :doc:`/core/sharding` + +.. _sharding-distribution-types: + Types of Distribution --------------------- @@ -80,30 +88,31 @@ Uneven distribution occurs in the following cases: - The collection is empty, *and* the inserted data is not evenly distributed. MongoDB fills one chunk before creating the next and - eventually must rebalance and move chunks between shards. Moving - chunks between shards affects performance. + eventually must rebalance and move chunks between shards, slowing + performance. - Neither the collection nor the inserted data are evenly distributed. MongoDB fills certain chunks too soon and eventually must rebalance - and move chunks between shards. Moving chunks between shards affects - performance. + and move chunks between shards, slowing performance. .. _sharding-monotonic-distribution: -All Inserts are to Last Chunk (Monotonic) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +All Writes Go to Last Chunk (Monotonic) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you insert documents with monotonically increasing :term:`shard keys `, such as the BSON ObjectID, MongoDB inserts all data into the last :term:`chunk` in the collection. -For an example, consider a :term:`sharded ` collection with +For example, consider a :term:`sharded ` collection with two chunks, the second of which has an unbounded upper limit. The "(" and ")" symbols indicate a non-inclusive value. The "[" and "]" symbols indicate an inclusive value: -(-infinity, 100) -[100, +infinity) +.. code-block:: sh + + (-infinity, 100) + [100, +infinity) If the data being inserted has an increasing key, then writes always direct to the shard containing the chunk with the unbounded upper limit. @@ -123,103 +132,102 @@ shard key as a separate field. Hashing may prevent the need to balance chunks by distributing data equally around the cluster. You can create a hash client-side. -Operations ----------- - -.. TODO -.. outline the procedures and rationale for each process. - .. _sharding-pre-splitting: +Operation +--------- + Pre-Splitting ~~~~~~~~~~~~~ -.. TODO -.. procedure for this process - -Pre-splitting is the process of specifying shard key ranges for :term:`chunks ` -prior to data insertion. Pre-splitting ensures the write operation is -evenly spread around the cluster. You should consider pre-splitting if: +Pre-splitting is the process of specifying :term:`shard key` ranges for +:term:`chunks ` prior to data insertion in order to ensure +MongoDB distributes the data evenly around the cluster. You should +consider pre-splitting if: - You are doing high volume inserts. -- The sharded collection is empty. +- The :term:`sharded ` collection is empty. - Either the collection's data or the data being inserted is not evenly distributed. - The shard key is monotonically increasing. -As an example of pre-splitting, consider a collection sharded by last -name with the following key distribution. The "(" and ")" symbols -indicate a non-inclusive value. The "[" and "]" symbols indicate an -inclusive value: +For more information on how MongoDB distributes inserted data, see +:ref:`sharding-distribution-types`. -["A", "Jones") -["Jones", "Smith") -["Smith", "zzz") +As an example of when you might choose to pre-split, consider a +collection sharded by last name with the following key distribution. The +"(" and ")" symbols indicate a non-inclusive value. The "[" and "]" +symbols indicate an inclusive value: -Although the chunk ranges may be split evenly, inserting lots of users -with with a common last name, such as "Jones" or "Smith", will -potentially monopolize a single shard. Making the chunk range more -granular in these portions of the alphabet may improve write -performance. +.. code-block:: sh -["A", "Jones") -["Jones", "Peters") -["Peters", "Smith") -["Smith", "Tyler") -["Tyler", "zzz"] + ["A", "Jones") + ["Jones", "Smith") + ["Smith", "zzz") -Procedure ---------- +Although the chunk ranges in the collection may be evenly split, +inserting a large number of new users with a common last name, such as +"Smith" or "Jones", will write disproportionately to a single shard, +monopolizing writes and reads to that shard. -In the example below the pre-split command splits the chunk where the -_id 99 would reside using that key as the split point. Note that a key -need not exist for a chunk to use it in its range. The chunk may even be -empty. +To avoid this situation, you would make the chunk range more granular: -The first step is to create a sharded collection to contain the data, -which can be done in three steps: +.. code-block:: sh -> use admin -> db.runCommand({ enableSharding : "foo" }) + ["A", "Jones") + ["Jones", "Parker") + ["Parker", "Smith") + ["Smith", "Tyler") + ["Tyler", "zzz"] -Next, we add a unique index to the collection "foo.bar" which is -required for the shard key. +Procedures +---------- -> use foo -> db.bar.ensureIndex({ _id : 1 }, { unique : true }) +Pre-Splitting +~~~~~~~~~~~~~ -Finally we shard the collection (which contains no data) using the _id -value. +This procedure assumes: -> use admin -switched to db admin -> db.runCommand( { split : "test.foo" , middle : { _id : 99 } } ) +- You have a database "foo". + +- The "foo" database has a collection "bar". + +- The "bar" collection has a unique index. + +- The "bar" collection, which contains no data, is sharded on _id : 99. + +Note that a key need not exist for a chunk to use it in its range. The +chunk may even be empty. Once the key range is specified, chunks can be moved around the cluster using the moveChunk command. -> db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) +.. code-block:: javascript + + db.runCommand( { moveChunk : "test.foo" , find : { _id : 99 } , to : "shard1" } ) You can repeat these steps as many times as necessary to create or move chunks around the cluster. To get information about the two chunks created in this example: -> db.printShardingStatus() ---- Sharding Status --- - sharding version: { "_id" : 1, "version" : 3 } - shards: - { "_id" : "shard0000", "host" : "localhost:30000" } - { "_id" : "shard0001", "host" : "localhost:30001" } - databases: - { "_id" : "admin", "partitioned" : false, "primary" : "config" } - { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } - test.foo chunks: - shard0001 1 - shard0000 1 - { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } - { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } +.. code-block:: javascript + + db.printShardingStatus() + --- Sharding Status --- + sharding version: { "_id" : 1, "version" : 3 } + shards: + { "_id" : "shard0000", "host" : "localhost:30000" } + { "_id" : "shard0001", "host" : "localhost:30001" } + databases: + { "_id" : "admin", "partitioned" : false, "primary" : "config" } + { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } + test.foo chunks: + shard0001 1 + shard0000 1 + { "_id" : { "$MinKey" : true } } -->> { "_id" : "99" } on : shard0001 { "t" : 2000, "i" : 1 } + { "_id" : "99" } -->> { "_id" : { "$MaxKey" : true } } on : shard0000 { "t" : 2000, "i" : 0 } Once the chunks and the key ranges are evenly distributed, you can proceed with a high volume insert. @@ -239,8 +247,8 @@ Thus it is very important to choose the right shard key up front. If you do need to change a shard key, an export and import is likely the best solution. Create a new pre-sharded collection, and then import the -exported data to it. If desired use a dedicated mongos for the export -and the import. +exported data to it. If desired use a dedicated :program:`mongos` for +the export and the import. .. :issue:`SERVER-4000`