Skip to content

Commit c31d1a6

Browse files
authored
(DOCSP-4806) Create a new Sampling page (#239)
* Creating Sampling page, updated refs * Sampling page * Wrap at 72 * Updates from copy review * Move Sampling page to bottom of ToC
1 parent 5a260a1 commit c31d1a6

File tree

6 files changed

+38
-68
lines changed

6 files changed

+38
-68
lines changed

source/faq.txt

Lines changed: 3 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,6 @@ Testing has shown that |compass| has minimal impact in prototype
2424
deployments, though additional performance testing and monitoring is in
2525
progress.
2626

27-
For best results, use MongoDB 3.2 or higher, which includes the
28-
:manual:`$sample </reference/operator/aggregation/sample/>` operator for
29-
efficient sampling on a collection. On older versions of MongoDB,
30-
|compass| falls back on a
31-
:ref:`less efficient sampling method <compass_fallback_sampling>`.
32-
3327
You should only execute queries that are indexed appropriately in the
3428
database to avoid scanning the entire collection.
3529

@@ -64,63 +58,6 @@ Why am I seeing a warning about a non-genuine MongoDB server?
6458

6559
.. include:: /includes/fact-non-genuine-warning.rst
6660

67-
.. _compass-faq-sampling:
68-
69-
What is sampling and why is it used?
70-
------------------------------------
71-
72-
Sampling in |compass| is the selection a subset of data
73-
from a particular collection and analyzing the documents within the
74-
sample set.
75-
76-
Sampling is a common technique in statistical analysis because analyzing
77-
a subset of the data gives similar results to analyzing all of it. In
78-
addition, sampling allows results to be generated quickly rather than
79-
performing a computationally-expensive collection scan.
80-
81-
How does sampling work?
82-
-----------------------
83-
84-
|compass| employs two distinct sampling mechanisms.
85-
86-
In MongoDB 3.2, collections are sampled with the
87-
:manual:`$sample </reference/operator/aggregation/sample/>` operator via
88-
the :manual:`aggregation pipeline </core/aggregation-pipeline>`. This
89-
provides efficient random sampling without replacement over the entire
90-
collection, or over the subset of documents specified by a query.
91-
92-
.. _compass_fallback_sampling:
93-
94-
In MongoDB 3.0, collections are sampled via a
95-
backwards-compatible algorithm executed entirely within |compass|. It
96-
takes place in three stages:
97-
98-
1. |compass| opens a :term:`cursor` on the desired collection, limited
99-
to at most 10,000 documents sorted in descending order of the ``_id``
100-
field.
101-
2. ``sampleSize`` documents are randomly selected from the stream. To
102-
do this efficiently, |compass| employs `reservoir sampling
103-
<http://en.wikipedia.org/wiki/Reservoir_sampling>`_.
104-
3. |compass| performs a query to select the chosen documents directly
105-
via ``_id``.
106-
107-
``sampleSize`` is set to 1000 documents.
108-
109-
.. note::
110-
The choice of sampling method is done transparently in the
111-
background, with no changes required by the user.
112-
113-
Won't sampling miss documents?
114-
------------------------------
115-
116-
Sampling is chosen for its efficiency: the amount of time required to
117-
perform a sample is minimal, on the order of a few seconds. Increasing
118-
the sample confidence will demand more processing power and time.
119-
Furthermore, sophisticated outlier detection requires an inspection of
120-
every document in a MongoDB deployment, which would be unfeasible for
121-
large data sets. The MongoDB team is in the process of conducting user
122-
tests on large data sets to find a reasonable balance.
123-
12461
What happens to long running queries?
12562
-------------------------------------
12663

@@ -133,9 +70,9 @@ Slow Sampling
13370
All queries that Compass sends to your MongoDB instance have a timeout
13471
flag set which automatically aborts a request if it takes longer than
13572
the specified timeout. This timeout is currently set to 10 seconds. If
136-
sampling on the database takes longer, Compass will notify you about
137-
the timeout and give you the options of (a) retrying with a longer
138-
timeout (60 seconds) or (b) running a different query.
73+
:ref:`sampling <sampling>` on the database takes longer, Compass will
74+
notify you about the timeout and give you the options of (a) retrying
75+
with a longer timeout (60 seconds) or (b) running a different query.
13976

14077
.. note::
14178

source/includes/extracts-query-bar.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ content: |
44
shows a sampling of the results. Otherwise, Compass
55
shows the entire result set.
66
7-
For details on sampling, see the :ref:`FAQ <compass-faq-sampling>`.
7+
For details on sampling, see :ref:`Sampling <sampling>`.
88
---
99
ref: query-bar-type-schema
1010
content: |

source/includes/toc-manage-data.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,9 @@ file: /validation
3434
description: |
3535
Learn how to ensure that all documents in a collection
3636
follow a defined set of rules.
37+
---
38+
file: /sampling
39+
description: |
40+
Learn how |compass-short| samples documents to provide
41+
insights about a collection.
3742
...

source/manage-data.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,4 @@ Interact with Your Data
1919
/indexes
2020
/schema
2121
/validation
22+
/sampling

source/sampling.txt

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
.. _sampling:
2+
3+
========
4+
Sampling
5+
========
6+
7+
.. default-domain:: mongodb
8+
9+
Sampling in |compass| is the selection of a subset of documents from a
10+
collection for analysis. Analyzing a sample set of data is a common
11+
statistical analysis technique; the results of analyzing a sample set
12+
tend to be similar to the results of analyzing an entire data set.
13+
14+
|compass-short| uses sampling for efficiency. Generally, standard
15+
sample sets can be selected and analyzed in a few seconds. Analyzing
16+
large samples or entire collections demands significantly more time and
17+
processing power.
18+
19+
Sampling Method
20+
---------------
21+
22+
|compass| samples 1,000 documents from a collection using the
23+
:manual:`$sample </reference/operator/aggregation/sample/>`
24+
operator via the
25+
:manual:`aggregation pipeline </core/aggregation-pipeline>`. This
26+
provides efficient, random sampling without replacement over an entire
27+
collection, or over the subset of documents specified by a query.

source/schema.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ The :guilabel:`Schema` tab provides an overview of the data type
2121
and shape of the fields in a particular collection. Databases
2222
and collections are visible in the left-side navigation.
2323

24-
The overview is based on :ref:`sampling <compass-faq-sampling>`
24+
The overview is based on :ref:`sampling <sampling>`
2525
the documents in the collection. The schema overview may include
2626
additional data about the contents of the fields, such as the
2727
minimum and maximum values of dates and integers, the frequency of

0 commit comments

Comments
 (0)