|
| 1 | +============================== |
| 2 | +Intro To Aggregation Framework |
| 3 | +============================== |
| 4 | + |
| 5 | +.. default-domain:: mongodb |
| 6 | + |
| 7 | +Overview |
| 8 | +-------- |
| 9 | + |
| 10 | +Using a data set containing information about zipcodes, this document will explore mongodb's aggregation framework. You can follow along by downloading `this data set <#>`_. You can import the data with :program:`mongoimport`. In the following examples, the documents used are in the ``zipcodes`` collection. |
| 11 | + |
| 12 | +Requirements |
| 13 | +------------ |
| 14 | + |
| 15 | +#. :program:`mongod` and :program:`mongo` v. 2.1.X or later |
| 16 | + |
| 17 | +#. The zipcode data set |
| 18 | + |
| 19 | +Data Model |
| 20 | +---------- |
| 21 | + |
| 22 | +The individual documents in this set look like: |
| 23 | + |
| 24 | +.. code-block:: javascript |
| 25 | + |
| 26 | + { |
| 27 | + "city" : "ACMAR", |
| 28 | + "loc" : [ |
| 29 | + -86.51557, |
| 30 | + 33.584132 |
| 31 | + ], |
| 32 | + "pop" : 6055, |
| 33 | + "state" : "AL", |
| 34 | + "_id" : "35004" |
| 35 | + } |
| 36 | + |
| 37 | +- ``loc`` holds the location as a latitude longitude pair |
| 38 | + |
| 39 | +- ``pop`` holds the population |
| 40 | + |
| 41 | +- ``_id`` holds the zipcode as a string |
| 42 | + |
| 43 | +- ``city`` holds the city |
| 44 | + |
| 45 | +- ``state`` holds the two letter state abbreviation |
| 46 | + |
| 47 | +Cities With Populations Over One Million |
| 48 | +---------------------------------------- |
| 49 | + |
| 50 | +To return all cities with a population greater than one million, you can use the aggregation framework thusly: |
| 51 | + |
| 52 | +.. code-block:: javascript |
| 53 | + |
| 54 | + db.zipcodes.aggregate( [ |
| 55 | + { $group : |
| 56 | + { _id : "$city", |
| 57 | + totalpop : { $sum : "$pop" } } }, |
| 58 | + { $match : {totalpop : { $gte : 1000000 } } } |
| 59 | + ] ); |
| 60 | + |
| 61 | +Aggregate takes only one argument, the pipeline of operations that will be prefermed on the documents in the collection. |
| 62 | + |
| 63 | +The first object in the pipeline is a :agg:pipeline:`$group` that is used to collect and compress documents with the same ``city``, transforming documents about zipcodes into documents about cities. ``totalpop`` is the only other field in the resulting documents. It is defined as the :agg:expression:`$sum` of the population fields of the documents being grouped together. After the :agg:pipeline:`$group` portion of this aggregation, the documents in the pipeline look like: |
| 64 | + |
| 65 | +.. code-block:: javascript |
| 66 | + |
| 67 | + { |
| 68 | + "_id" : "HILLISBURG", |
| 69 | + "totalpop" : 20713 |
| 70 | + } |
| 71 | + |
| 72 | +The second and final pipeline object is a :agg:pipeline:`$match` used to obtain only the documents where the ``totalpop`` is greater than or equal to one million. As :agg:pipeline:`$match` does not alter the format of the documents in the pipe, the final result contains the same documents as the result of :agg:pipeline:`$group`, but without the entries where the population is less than one million. |
| 73 | + |
| 74 | +Largest and Smallest Cities by State |
| 75 | +------------------------------------ |
| 76 | + |
| 77 | +To find each state's largest and smallest cities by population using the aggregation framework, use: |
| 78 | + |
| 79 | +.. code-block:: javascript |
| 80 | + |
| 81 | + db.zipcodes.aggregate( [ |
| 82 | + { $group : |
| 83 | + { _id : { state : "$state", city : "$city" }, |
| 84 | + pop : { $sum : "$pop" } } }, |
| 85 | + { $sort : { pop : 1 } }, |
| 86 | + { $group : |
| 87 | + { _id : "$_id.state", |
| 88 | + biggestcity : { $last : "$_id.city" }, |
| 89 | + biggestpop : { $last : "$pop" }, |
| 90 | + smallestcity : { $first : "$_id.city" }, |
| 91 | + smallestpop : { $first : "$pop" } } }, |
| 92 | + { $project : |
| 93 | + { _id : 0, |
| 94 | + state : "$_id", |
| 95 | + biggestCity : { name : "$biggestcity", pop: "$biggestpop" }, |
| 96 | + smallestCity : { name : "$smallestcity", pop : "$smallestpop" } } } |
| 97 | + ] ); |
| 98 | + |
| 99 | +The first :agg:pipeline:`$group` groups by both ``city`` and ``state`` by choosing ``_id`` to be an object containing both of them. This preserves the ``state`` for later. The documents it creates have only one other field ``pop`` that is the :agg:expression:`$sum` of the population fields. The documents now look like: |
| 100 | + |
| 101 | +.. code-block:: javascript |
| 102 | + |
| 103 | + { |
| 104 | + "_id" : { |
| 105 | + "state" : "CO", |
| 106 | + "city" : "EDGEWATER" |
| 107 | + }, |
| 108 | + "pop" : 13154 |
| 109 | + } |
| 110 | + |
| 111 | +:agg:pipeline:`$sort` arranges the documents in the stream in increasing order of ``pop``. This does not alter the documents themselves, just the ordering. |
| 112 | + |
| 113 | +The next :agg:pipeline:`$group` groups the city-state documents by ``state`` (which is a field inside the nest ``_id`` object), saving the :agg:expression:`$first` and :agg:expression:`$last` documents' ``city`` and ``pop`` in the fields ``smallestcity``, ``smallestpop``, ``biggestcity``, and ``biggestpop``. Since the documents are in increasing order by population, these will be the smallest and largest cities. The documents now look like: |
| 114 | + |
| 115 | +.. code-block:: javascript |
| 116 | + |
| 117 | + { |
| 118 | + "_id" : "WA", |
| 119 | + "biggestcity" : "SEATTLE", |
| 120 | + "biggestpop" : 520096, |
| 121 | + "smallestcity" : "BENGE", |
| 122 | + "smallestpop" : 2 |
| 123 | + } |
| 124 | + |
| 125 | +Lastly, :agg:pipeline:`$project` renames ``_id`` to ``state`` and moves the information about the biggest and smallest cities from toplevel fields into subdocuments ``biggestCity`` and ``smallestCity``. The final results look like: |
| 126 | + |
| 127 | +.. code-block:: javascript |
| 128 | + |
| 129 | + { |
| 130 | + "state" : "RI", |
| 131 | + "biggestCity" : { |
| 132 | + "name" : "CRANSTON", |
| 133 | + "pop" : 176404 |
| 134 | + }, |
| 135 | + "smallestCity" : { |
| 136 | + "name" : "CLAYVILLE", |
| 137 | + "pop" : 45 |
| 138 | + } |
| 139 | + } |
| 140 | + |
| 141 | +Average City Population by State |
| 142 | +-------------------------------- |
| 143 | + |
| 144 | +To find the average populations for cities in each state using the aggregation framework, run: |
| 145 | + |
| 146 | +.. code-block:: javascript |
| 147 | + |
| 148 | +db.zipcode.aggregate( [ |
| 149 | + { $group : |
| 150 | + { _id : { state : "$state", city : "$city" }, |
| 151 | + pop : { $sum : "$pop" } } }, |
| 152 | + { $group : |
| 153 | + { _id : "$_id.state", |
| 154 | + avgCityPop : { $avg : "$pop" } } }, |
| 155 | + ] ); |
| 156 | + |
| 157 | +The first :agg:pipeline:`$group` is exactly the same as the Largest and Smallest Cities by State example and will have the same result. |
| 158 | + |
| 159 | +The latter :agg:pipeline:`$group` groups the city-state documents by ``state``, and uses :agg:expression:`$avg` to average the populations of all the city-state documents belonging to that ``state`` and store the result in ``avgCityPop``. The resulting documents look like: |
| 160 | + |
| 161 | +.. code-block:: javascript |
| 162 | + |
| 163 | + { |
| 164 | + "_id" : "MN", |
| 165 | + "avgCityPop" : 5335 |
| 166 | + }, |
0 commit comments