Skip to content

Commit fb64712

Browse files
committed
write draft of aggro framework tutorial
1 parent 01f6433 commit fb64712

File tree

1 file changed

+166
-0
lines changed

1 file changed

+166
-0
lines changed
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
==============================
2+
Intro To Aggregation Framework
3+
==============================
4+
5+
.. default-domain:: mongodb
6+
7+
Overview
8+
--------
9+
10+
Using a data set containing information about zipcodes, this document will explore mongodb's aggregation framework. You can follow along by downloading `this data set <#>`_. You can import the data with :program:`mongoimport`. In the following examples, the documents used are in the ``zipcodes`` collection.
11+
12+
Requirements
13+
------------
14+
15+
#. :program:`mongod` and :program:`mongo` v. 2.1.X or later
16+
17+
#. The zipcode data set
18+
19+
Data Model
20+
----------
21+
22+
The individual documents in this set look like:
23+
24+
.. code-block:: javascript
25+
26+
{
27+
"city" : "ACMAR",
28+
"loc" : [
29+
-86.51557,
30+
33.584132
31+
],
32+
"pop" : 6055,
33+
"state" : "AL",
34+
"_id" : "35004"
35+
}
36+
37+
- ``loc`` holds the location as a latitude longitude pair
38+
39+
- ``pop`` holds the population
40+
41+
- ``_id`` holds the zipcode as a string
42+
43+
- ``city`` holds the city
44+
45+
- ``state`` holds the two letter state abbreviation
46+
47+
Cities With Populations Over One Million
48+
----------------------------------------
49+
50+
To return all cities with a population greater than one million, you can use the aggregation framework thusly:
51+
52+
.. code-block:: javascript
53+
54+
db.zipcodes.aggregate( [
55+
{ $group :
56+
{ _id : "$city",
57+
totalpop : { $sum : "$pop" } } },
58+
{ $match : {totalpop : { $gte : 1000000 } } }
59+
] );
60+
61+
Aggregate takes only one argument, the pipeline of operations that will be prefermed on the documents in the collection.
62+
63+
The first object in the pipeline is a :agg:pipeline:`$group` that is used to collect and compress documents with the same ``city``, transforming documents about zipcodes into documents about cities. ``totalpop`` is the only other field in the resulting documents. It is defined as the :agg:expression:`$sum` of the population fields of the documents being grouped together. After the :agg:pipeline:`$group` portion of this aggregation, the documents in the pipeline look like:
64+
65+
.. code-block:: javascript
66+
67+
{
68+
"_id" : "HILLISBURG",
69+
"totalpop" : 20713
70+
}
71+
72+
The second and final pipeline object is a :agg:pipeline:`$match` used to obtain only the documents where the ``totalpop`` is greater than or equal to one million. As :agg:pipeline:`$match` does not alter the format of the documents in the pipe, the final result contains the same documents as the result of :agg:pipeline:`$group`, but without the entries where the population is less than one million.
73+
74+
Largest and Smallest Cities by State
75+
------------------------------------
76+
77+
To find each state's largest and smallest cities by population using the aggregation framework, use:
78+
79+
.. code-block:: javascript
80+
81+
db.zipcodes.aggregate( [
82+
{ $group :
83+
{ _id : { state : "$state", city : "$city" },
84+
pop : { $sum : "$pop" } } },
85+
{ $sort : { pop : 1 } },
86+
{ $group :
87+
{ _id : "$_id.state",
88+
biggestcity : { $last : "$_id.city" },
89+
biggestpop : { $last : "$pop" },
90+
smallestcity : { $first : "$_id.city" },
91+
smallestpop : { $first : "$pop" } } },
92+
{ $project :
93+
{ _id : 0,
94+
state : "$_id",
95+
biggestCity : { name : "$biggestcity", pop: "$biggestpop" },
96+
smallestCity : { name : "$smallestcity", pop : "$smallestpop" } } }
97+
] );
98+
99+
The first :agg:pipeline:`$group` groups by both ``city`` and ``state`` by choosing ``_id`` to be an object containing both of them. This preserves the ``state`` for later. The documents it creates have only one other field ``pop`` that is the :agg:expression:`$sum` of the population fields. The documents now look like:
100+
101+
.. code-block:: javascript
102+
103+
{
104+
"_id" : {
105+
"state" : "CO",
106+
"city" : "EDGEWATER"
107+
},
108+
"pop" : 13154
109+
}
110+
111+
:agg:pipeline:`$sort` arranges the documents in the stream in increasing order of ``pop``. This does not alter the documents themselves, just the ordering.
112+
113+
The next :agg:pipeline:`$group` groups the city-state documents by ``state`` (which is a field inside the nest ``_id`` object), saving the :agg:expression:`$first` and :agg:expression:`$last` documents' ``city`` and ``pop`` in the fields ``smallestcity``, ``smallestpop``, ``biggestcity``, and ``biggestpop``. Since the documents are in increasing order by population, these will be the smallest and largest cities. The documents now look like:
114+
115+
.. code-block:: javascript
116+
117+
{
118+
"_id" : "WA",
119+
"biggestcity" : "SEATTLE",
120+
"biggestpop" : 520096,
121+
"smallestcity" : "BENGE",
122+
"smallestpop" : 2
123+
}
124+
125+
Lastly, :agg:pipeline:`$project` renames ``_id`` to ``state`` and moves the information about the biggest and smallest cities from toplevel fields into subdocuments ``biggestCity`` and ``smallestCity``. The final results look like:
126+
127+
.. code-block:: javascript
128+
129+
{
130+
"state" : "RI",
131+
"biggestCity" : {
132+
"name" : "CRANSTON",
133+
"pop" : 176404
134+
},
135+
"smallestCity" : {
136+
"name" : "CLAYVILLE",
137+
"pop" : 45
138+
}
139+
}
140+
141+
Average City Population by State
142+
--------------------------------
143+
144+
To find the average populations for cities in each state using the aggregation framework, run:
145+
146+
.. code-block:: javascript
147+
148+
db.zipcode.aggregate( [
149+
{ $group :
150+
{ _id : { state : "$state", city : "$city" },
151+
pop : { $sum : "$pop" } } },
152+
{ $group :
153+
{ _id : "$_id.state",
154+
avgCityPop : { $avg : "$pop" } } },
155+
] );
156+
157+
The first :agg:pipeline:`$group` is exactly the same as the Largest and Smallest Cities by State example and will have the same result.
158+
159+
The latter :agg:pipeline:`$group` groups the city-state documents by ``state``, and uses :agg:expression:`$avg` to average the populations of all the city-state documents belonging to that ``state`` and store the result in ``avgCityPop``. The resulting documents look like:
160+
161+
.. code-block:: javascript
162+
163+
{
164+
"_id" : "MN",
165+
"avgCityPop" : 5335
166+
},

0 commit comments

Comments
 (0)