diff --git a/draft/administration/production-notes.txt b/draft/administration/production-notes.txt new file mode 100644 index 00000000000..454ffccf55e --- /dev/null +++ b/draft/administration/production-notes.txt @@ -0,0 +1,632 @@ +================ +Production Notes +================ + +.. default-domain:: mongodb + +Overview +-------- + +This page details system configurations that affect MongoDB, +especially in production. + +Backups +------- + +To make backups of your MongoDB database, please refer to the +:ref:`backups section `. + +Port Numbers +------------ + +MongoDB uses only TCP ports for network connections. The following is +a listing of MongoDB programs and default ports used. + +.. list-table:: Default ports used by MongoDB + :header-rows: 1 + + * - Description + - Process + - Port Number + * - Standalone MongoDB server + - :program:`mongod` + - 27017 + * - Shard Router + - :program:`mongos` + - 27017 + * - Shard Server + - :setting:`mongod --shardsvr ` + - 27018 + * - Config Server + - :setting:`mongod --configsvr ` + - 27019 + * - mongod Web Stat + - :ref:`mongod --rest ` + - 28107 + +Securing the Network +-------------------- + +MongoDB is not designed to be used on an open network and is highly +recommended to be operated within a secure network. The following +details system configuration to allow MongoDB to be operational in a +secure network. + +.. TODO link to MongDB security section when available. + +.. _production-firewall: + +Firewall Rules +~~~~~~~~~~~~~~ + +.. list-table:: Firewall Rules Required for MongoDB + :header-rows: 1 + + * - Process + - Direction + - Ports + - Notes + * - :program:`mongod` + - Incoming + - 27017/* + - for client and application server connections + * - :program:`mongod` + - Outgoing + - */27017 + - for :term:`replication ` from other members, shard migrations + * - :program:`mongos` + - Incoming + - 27017/* + - for client and application server connections + * - :program:`mongod` + - Outgoing + - */[27019, 27018] + - for connections to :term:`shards ` and :term:`config servers` + +.. note - do we need to have the terms linked in the table when the + notes below would be more 'natural' to have such links? + +In a sharded environment: + +- All the mongo processes (mongos, mongod, :setting:`mongod + --configsvr `) in the cluster should be able to connect + to each other. + +- Clients must be able to connect to the mongos processes; however, + they can be blocked from the mongod's. + +In a non-sharded replica set environment: + +- All clients need to be able to connect to all non-hidden replica set + members. + +- All members of a replica set (the mongod processes that is) need to + be able to communicate with each other. + +IP Address Binding +~~~~~~~~~~~~~~~~~~ + +By default, a :program:`mongod` server will listen on all available IP +addresses on a machine. You can restrict this to a single IP address +with the :setting:`bind_ip` configuration option for mongod. + +For example this could be set to ``127.0.0.1``, the loopback +interface, to make mongod only listen to requests from the same +machine (``localhost``). Or, on a machine with two interfaces, we +might want to listen only on the private network. + +To enable listening on all interfaces, remove the bind_ip option from +your server configuration file. + +.. note:: + + To accept requests on external interfaces you may also have to + modify your computer's firewall configuration to allow access to + the ports used by mongo. + + .. seealso:: :ref:`production-firewall` + +Linux Notes +----------- + +Most versions of the Linux kernel will support MongoDB, but there are +a few key issues to keep in mind: + +- The MongoDB user community has approved Linux kernel 2.6.36 as + a good candidate for running MongoDB in production. + +- Some have reported skepticism on behavior of Linux 2.6.33-31 and + 2.6.32 kernel, at least on Amazon EC2. + +MongoDB preallocates its database files before using them, so file +systems such as ext4 and XFS with kernel support is ideal: + +- If you are using ext4 file system, Linux kernel 2.6.23 or newer + is required for efficient filesystem preallocation. + +- If you are using XFS file system, Linux kernel 2.6.25 or newer is + required for efficient file preallocation. + +General system configuration recommendations: + +- Turn off ``atime`` for the storage volume with the :term:`database + files `. + +- Set file descriptor limit and user process limit to 20,000 (see + :ref:`etc/limits` and :term:`ulimit`). A low ulimit will affect + MongoDB when under heavy use and will produce weird errors. + +- Do not use large virtual memory pages, MongoDB performs better with + smaller virtual memory pages. + +- Disable NUMA in your BIOS. If that is not possible see :ref:`NUMA + `. + +- Ensure that readahead settings for the block devices that store the + database files are acceptable. See the :ref:`Readahead + ` section + +- Use NTP to synchronize time between your hosts. MongoDB uses + distributed locks that requires hosts to be synchronized, especially + :term:`config servers`. + +.. _production-numa: + +NUMA and MongoDB +---------------- + +NUMA, Non-Uniform Access Memory, and MongoDB do not work well +together. If you are running MongoDB on NUMA hardware, we recommend +disabling NUMA for MongoDB and running with an interleave memory +policy. Operational problems in MongoDB will manifest in strange ways, +such as slow performance for periods of time or high system processor +usage. + +.. note:: + + On Linux, :program:`mongod` v2.0+ checks these settings on startup + and prints a warning if the system is NUMA based. + +To turn off NUMA for MongoDB, use the ``numactl`` command and start +:program:`mongod` in the following manner: + + .. code-block:: bash + + numactl --interleave=all /usr/bin/local/mongod + +Adjust the ``proc`` settings using the following command: + + .. code-block:: bash + + echo 0 > /proc/sys/vm/zone_reclaim_mode + +You can change ``zone_reclaim_mode`` without restarting mongod. For +more information, see documentation on `Proc/sys/vm +`_. + +.. TODO the following is needed? or is just generally good reading material? + +Further reading +~~~~~~~~~~~~~~~ + +`The MySQL “swap insanity” problem and the effects of the NUMA +architecture +`_ +describes the effects of NUMA on databases. This blog post was aimed +at problems NUMA created for MySQL, but the issues are similar. The +post describes the NUMA architecture and its goals, and how these are +incompatible with production databases. + +.. _production-readahead: + +Virtualization +-------------- + +Generally MongoDB works very well in virtualized environments, with +the exception of OpenVZ. + +EC2 +~~~ +Compatible. No special configuration requirements. + +VMWare +~~~~~~ + +Some suggest not using overcommit as they may cause issues. Otherwise compatible. + +Cloning a VM is possible. For example you might use this to spin up a +new virtual host that will be added as a member of a replica set. If +Journaling is enabled, the clone snapshot will be consistent. If not +using journaling, stop mongod, clone, and then restart. + +OpenVZ +~~~~~~ + +Issues have been reported here. + +iostat +------ + +On Linux, use the iostat command to check if disk I/O is a bottleneck +for your database. + +We generally find the form: + +.. code-block:: bash + + iostat -xm 2 + +to work well. (Use a number of seconds with iostat, otherwise it will +display stats since server boot, which is not very useful.) + +Use the mount command to see what device your :term:`data directory +`_. + +Readahead can improve performance of many applications, but in a +system with high random access, like a database, having readahead set +too high can cause some serious problems, specifically in terms of how +memory is used and how much the disk is accessed. + +Readahead Configuration +~~~~~~~~~~~~~~~~~~~~~~~ + +In Linux, you can check the current settings for readahead by running +``blockdev --report``. This will print out a table with one row for +each disk device. + +.. code-block:: bash + + RO RA SSZ BSZ StartSec Size Device + rw 256 512 4096 0 80026361856 /dev/sda + rw 256 512 4096 2048 80025223168 /dev/sda1 + +The RA column contains the value for readahead. That value is the +number of 512 byte blocks that are read on every disk access. In the +above example, the readahead value is 128KB (256 blocks * 512 +bytes/block). You can set the readahead setting for a given disk +device by running ``blockdev --setra ``. + +When using a software based RAID system make sure to set the readahead +on each disk device as well as on the device that corresponds to the +RAID controller. + +Once the readahead setting has been changed, the :program:`mongod` +process must be restarted. + +Readahead Misconfiguration +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Figuring out the appropriate value for readahead is tricky. If +readahead is set too high MongoDB won't be able to use all the RAM +available on the machine effectively. This is because each disk read +will be pulling a lot of extra data which will take up room in memory, +but will never be accessed before being paged out to make room for +another read operation. For more information on how to detect when +readahead might be too high, read: `diagnosing high readahead +`_. + +Setting readahead too low, however, can also cause problems because if +you are accessing sequential data, you can wind up not paging in +enough on each read, causing more :term:`page faults ` and +another disk access to read the data you need. + +If readahead is too high OR too low it can cause excessive page +faulting and increased disk utilization. + +The Best Readahead Value +~~~~~~~~~~~~~~~~~~~~~~~~ + +What the right value of readahead is depends on your storage device, +the size of your documents, and your access patterns. + +Disk +++++ + +Readahead is most helpful on spinning disks where sequential access is +comparatively fast. On :term:`SSDs ` (where random access is just +as fast as sequential access) you don't gain much from a large +readahead value, so you'll generally want a pretty small value for +readahead if you are using SSDs. + +Document size ++++++++++++++ + +You also want to make sure that readahead is high enough to pull in a +full single document. Let's say your average document size was 4k - +since blocks on disk are generally 512 bytes it would take 8 disk +accesses to read in whole document with no readahead. If you had a +readahead of 8 blocks or more, you would read in the whole document +with only one trip to disk. Since index buckets are 8k, you will never +want to set readahead below 16 or it will take 2 disk accesses to read +in one index bucket (16 is a very low value for readahead, so it +should be rare that you would want to do that anyway). + +Stripe size (RAID Only) ++++++++++++++++++++++++ + +While disk block sizes are generally predictable, the stripe size of a +RAID array is configurable (generally some multiple of 16kB). Since +the stripe size represents the minimum amount of data you can read +from the array you will want to set readahead to be a multiple of that +stripe size. The stripe size also gives you a handy minimum for this +particular type of configuration since it is likely to be +significantly larger than the previous minimum values discussed. + +Access patterns +~~~~~~~~~~~~~~~ + +The last and arguably most important factor when determining a good +value for readahead is access patterns. If you do a lot of range +queries or sequential accessing of data, then readahead can be quite +beneficial, allowing you to read in a large amount of data that you're +going to need anyway with a single disk access. If you do mostly +random access, however, you could wind up pulling in a lot of extra +data that you will never use, wasting both RAM and disk time. + +Readahead on EC2 +~~~~~~~~~~~~~~~~ + +Some of the Amazon Machine Images on EC2 have very high default values for disk +readahead (sometimes as much as several thousand blocks), which can +have a have a severe impact on your memory usage and disk +utilization. For MongoDB instances running on EC2, ensure readahead +settings are not too high and turn it down if +necessary. Another problem with readahead on EC2 is that EBS does not +guarantee that sequential blocks are actually sequential on the +underlying physical storage. This means that even if your application +does mostly sequential access, that might not actually be sequential +access on the underlying hardware. This means that multiple disk seeks +may be necessary in one disk access, eliminating most of the benefits +of readahead and possibly causing unnecessarily high disk +utilization. For this reason, we usually recommend a lower than usual +readahead value when running on an EBS volume, something in the range +of 16 to 64 blocks. + +RAID on EC2 +~~~~~~~~~~~ + +If you are following our Amazon EC2 guidelines, and running with +multiple EBS volumes in a RAID configuration, then the recommendation +is to set the readahead to 0 on the volumes that make up the RAID +while setting the readahead to a multiple of your RAID stripe size on +the RAID volume itself (with the minimum being the same as the stripe +size). + +Hardware +-------- + +MongoDB tends to run well on virtually all hardware. In fact it was +designed specifically with commodity hardware in mind (to facilitate +cloud computing); that said it works well on very large servers +too. That said if you are about to buy hardware here are a few +suggestions: + +- More RAM is good. + +- Fast CPU clock speed is helpful. + +- Many cores helps but does not provide a high level of marginal + return, so don't spend money on them. (This is both a consequence of + the design of the program and also that memory bandwidth can be a + limiter; there isn't necessarily a lot of computation happening + inside a database). + +- non-NUMA is recommended as NUMA is not very helpful as memory access + is not very localized in a database. If you must run MongoDB on a + NUMA system, see :ref:`NUMA and MongoDB `. + +- SSD is good. We have had good results and have seen good + price/performance with SATA SSDs; the (typically) more upscale PCI + SSDs work fine too. + +- Commodity (SATA) spinning drives are often a good option as the + speed increase for random I/O for more expensive drives is not that + dramatic (only on the order of 2x) – spending that money on SSDs or + RAM may be more effective. + +Solid State Disks +----------------- + +Multiple MongoDB users have reported good success running MongoDB +databases on solid state drives. + +Write Endurance +~~~~~~~~~~~~~~~ + +Write endurance with solid state drives vary. SLC drives have higher +endurance but newer generation MLC (and eMLC) drives are getting +better. + +As an example, the MLC Intel 320 drives specify endurance of 20GB/day +of writes for five years. If you are doing small or medium size random +reads and writes this is sufficient. The Intel 710 series is the +enterprise-class models and have higher endurance. + +If you intend to write a full drive's worth of data writing per day +(and every day for a long time), this level of endurance would be +insufficient. For large sequential operations (for example very large +map/reduces), one could write far more than 20GB/day. Traditional hard +drives are quite good at sequential I/O and thus may be better for +that use case. + +.. seealso:: `SSD lifespan `_ + +Reserve some unpartitioned space +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some users report good results when leaving 20% of their drives +completely unpartitioned. In this situation the drive knows it can use +that space as working space. Note formatted but empty space may or may +not be available to the drive depending on TRIM support which is often +lacking. + +smartctl +~~~~~~~~ + +On some devices, "smartctl -A" will show you the +Media_Wearout_Indicator. + +.. code-block:: bash + + $ sudo smartctl -A /dev/sda | grep Wearout + 233 Media_Wearout_Indicator 0x0032 099 099 000 Old_age Always - 0 + +Speed +~~~~~ + +A `paper `_ in ACM +Transactions on Storage (Sep2010) listed the following results for +measured 4KB peak random direct IO for some popular devices: + +.. list-table:: SSD Read and Write Performance + :header-rows: 1 + + * - Device + - Read IOPS + - Write IOPS + * - Intel X25-E + - 33,400 + - 3,120 + * - FusionIO ioDrive + - 98,800 + - 75,100 + +Intel's larger drives seem to have higher write IOPS than the smaller +ones (up to 23,000 claimed for the 320 series). More info here. + +Real-world results should be lower, but the numbers are still impressive. + +Reliability +~~~~~~~~~~~ + +Some manufacturers specify relability stats indicating failure rates +of approximately 0.6% per year. This is better than traditional drives +(2% per year failure rate or higher), but still quite high and thus +mirroring will be important. (And of course manufacture specs could be +optimistic.) + +Random reads vs. random writes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Random access I/O is the sweet spot for SSD. Historically random reads +on SSD drives have been much faster than random writes. That said, +random writes are still an order of magnitude faster than spinning +disks. + +Recently new drives have released that have much higher random write +performance. For example the Intel 320 series, particular the larger +capacity drives, has much higher random write performance than the +older Intel X25 series drives. + +PCI vs. SATA +~~~~~~~~~~~~ + +SSD is available both as PCI cards and SATA drives. PCI is oriented +towards the high end of products on the market. + +Some SATA SSD drives now support 6Gbps sata transfer rates, yet at the +time of this writing many controllers shipped with servers are +3Gbps. For random IO oriented applications this is likely sufficient, +but worth considering regardless. + +RAM vs. SSD +~~~~~~~~~~~ + +Even though SSDs are fast, RAM is still faster. Thus for the highest +performance possible, having enough RAM to contain the working set of +data from the database is optimal. However, it is common to have a +request rate that is easily met by the speed of random IO's with SSDs, +and SSD cost per byte is lower than RAM (and persistent too). + +A system with less RAM and SSDs will likely outperform a system with +more RAM and spinning disks. For example a system with SSD drives and +64GB RAM will often outperform a system with 128GB RAM and spinning +disks. (Results will vary by use case of course.) + +.. TODO this is basically a 'soft page fault' + +One helpful characteristic of SSDs is they can facilitate fast +"preheat" of RAM on a hardware restart. On a restart a system's RAM +file system cache must be repopulated. On a box with 64GB RAM or more, +this can take a considerable amount of time – for example six minutes +at 100MB/sec, and much longer when the requests are random IO to +spinning disks. + +FlashCache +~~~~~~~~~~ + +FlashCache is a write back block cache for Linux. It was created by +Facebook. Installation is a bit of work as you have to build and +install a kernel module. Sep2011: If you use this please report +results in the mongo forum as it's new and everyone will be curious +how well it works. + +http://www.facebook.com/note.php?note_id=388112370932 + +OS scheduler +~~~~~~~~~~~~ + +One user reports good results with the noop IO scheduler under certain +configurations of their system. As always caution is recommended on +nonstandard configurations as such configurations never get as much +testing... + +Run mongoperf +~~~~~~~~~~~~~ + +mongoperf is a disk performance stress utility. It is not part of the +mongo database, simply a disk exercising program. We recommend testing +your SSD setup with mongoperf. Note that the random writes it are a +worst case scenario, and in many cases MongoDB can do writes that are +much larger. + +Redundant Array of Independent Disks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Typically we recommend using RAID-10. + .. TODO define (or link to appropriate article defining all RAID types. + +RAID-5 and RAID-6 can be slow, and are not suggested. + +RAID-0 will provide good write performance but provides limited +availability, and reduced performance on reads (particularly on +EBS). It is not suggested. + +See also the ec2 page for comments on EBS striping. + +Remote File Systems +~~~~~~~~~~~~~~~~~~~ + +Amazon elastic block store (EBS) seems to work well up to its +intrinsic performance characteristics, when configured well. + +We have found that some versions of NFS perform very poorly and do not +recommend using NFS with MongoDB.