Skip to content

Commit adec581

Browse files
DOCSP-16252 Add Compound Index analogy for partition attributes (#157)
* DOCSP-16252 Add Compound Index analogy for partition attributes * DOCSP-16252 updates for copy review feedback * DOCSP-16252 updates for feedback
1 parent 5cd8700 commit adec581

File tree

1 file changed

+71
-17
lines changed

1 file changed

+71
-17
lines changed

source/admin/optimize-query-performance.txt

Lines changed: 71 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -27,27 +27,81 @@ following factors:
2727
Data Structure in |s3|
2828
----------------------
2929

30-
For easier management, make sure that your data is
31-
logically grouped into partitions. You can leverage partitions to
32-
improve {+data-lake-short+} performance by mapping them to partition
33-
attributes in your :doc:`configuration
34-
</reference/format/data-lake-configuration>`.
35-
36-
You can improve your {+data-lake-short+}\'s performance by ensuring that
37-
your partition structure maps to your query patterns and that it is
38-
defined in your :doc:`configuration
39-
</reference/format/data-lake-configuration>`. By mapping your *partition
40-
attributes* (the parts of your |s3| prefix that looks like a folder) to
41-
a query attribute, {+data-lake-short+} can selectively open the files
42-
that contain data related to your query. This both reduces the amount of
43-
time a query takes and decreases cost, since {+data-lake-short+} reads
44-
and downloads less files from |aws|.
30+
For easier management, ensure that your data is logically grouped
31+
into partitions. {+adl+} utilizes partitions you create with the field
32+
values that you specify in your :ref:`partition syntax
33+
<datalake-path-syntax>`. You can improve your {+dl+}\'s performance by
34+
ensuring that your partition structure maps to your query patterns and
35+
the partition structure is defined in your
36+
:datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`. For
37+
the partition, choose fields that you query frequently and order them
38+
from the most frequently queried in the first position to the least
39+
queried field in the last position.
40+
41+
The order of fields listed in the
42+
:datalakeconf:`databases.[n].collections.[n].dataSources.[n].path` is
43+
important in the same way as it is in :manual:`Compound Indexes
44+
</core/index-compound/>`. The specified path corresponds to data that
45+
is partitioned first by the value of the first field, and then by the
46+
value of the next field, and so on.
47+
48+
.. example::
49+
50+
Consider a collection with the ``software``, ``computer``, and
51+
``OS`` fields and partitions on the |s3| bucket named ``metrics``
52+
first for the ``software`` field, followed by the ``computer``
53+
field, and then the ``OS`` field.
54+
55+
.. code-block:: text
56+
:copyable: false
57+
58+
metrics
59+
|--software
60+
|--computer
61+
|--OS
62+
63+
{+adl+} uses the partitions for queries on the these fields:
64+
65+
- the ``software`` field,
66+
- the ``software`` field and the ``computer`` field,
67+
- the ``software`` field and the ``computer`` field and the
68+
``OS`` field.
69+
70+
{+adl+} can use the partitions to support a query on the
71+
``software`` and ``OS`` fields. However, in this case, {+adl+} is
72+
not as efficient for the query as it would be if the query was on
73+
the ``software`` and ``computer`` fields only. Partitions are parsed
74+
in order; if a query omits a particular partition, {+adl+} is less
75+
efficient in making use of any partitions that follow the partition.
76+
Because a query on ``software`` and ``OS`` omits ``computer``,
77+
{+adl+} uses the ``software`` partition more efficiently than the
78+
``OS`` partition to support this query.
79+
80+
{+adl+} can't use the partitions to support queries on fields not
81+
specified in the
82+
:datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`.
83+
Also, {+adl+} can't use the partitions to support queries that
84+
include the following fields without the ``software`` field:
85+
86+
- the ``computer`` field,
87+
- the ``OS`` field, or
88+
- the ``computer`` and ``OS`` fields.
89+
90+
You can use partitions to improve {+dl+} performance by mapping
91+
them to partition attributes in your :doc:`configuration
92+
</reference/format/data-lake-configuration>`. By mapping your
93+
*partition attributes* (the parts of your |s3| prefix that looks like a
94+
folder) to a query attribute, {+adl+} can selectively open the files
95+
that contain data related to your query. This reduces the amount of
96+
time a query takes and decreases cost, because {+dl+} reads and
97+
downloads less files from |aws|.
4598

4699
.. example::
47100

48101
Consider an |s3| bucket ``metrics`` with the following structure:
49102

50103
.. code-block:: text
104+
:copyable: false
51105

52106
metrics
53107
|--hardware
@@ -66,9 +120,9 @@ and downloads less files from |aws|.
66120
your configuration . If you issue a query that contains
67121
``{metric_type: software, software_type: computer}``,
68122
{+data-lake-short+} ignores files with the prefix ``/phone``.
69-
123+
70124
For more information on mapping partition attributes to a collection
71-
:datalakeconf:`~databases.[n].collections.[n].dataSources.[n].path`, see
125+
:datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`, see
72126
:ref:`datalake-path-syntax`.
73127

74128
Data File Size

0 commit comments

Comments
 (0)