@@ -27,27 +27,81 @@ following factors:
27
27
Data Structure in |s3|
28
28
----------------------
29
29
30
- For easier management, make sure that your data is
31
- logically grouped into partitions. You can leverage partitions to
32
- improve {+data-lake-short+} performance by mapping them to partition
33
- attributes in your :doc:`configuration
34
- </reference/format/data-lake-configuration>`.
35
-
36
- You can improve your {+data-lake-short+}\'s performance by ensuring that
37
- your partition structure maps to your query patterns and that it is
38
- defined in your :doc:`configuration
39
- </reference/format/data-lake-configuration>`. By mapping your *partition
40
- attributes* (the parts of your |s3| prefix that looks like a folder) to
41
- a query attribute, {+data-lake-short+} can selectively open the files
42
- that contain data related to your query. This both reduces the amount of
43
- time a query takes and decreases cost, since {+data-lake-short+} reads
44
- and downloads less files from |aws|.
30
+ For easier management, ensure that your data is logically grouped
31
+ into partitions. {+adl+} utilizes partitions you create with the field
32
+ values that you specify in your :ref:`partition syntax
33
+ <datalake-path-syntax>`. You can improve your {+dl+}\'s performance by
34
+ ensuring that your partition structure maps to your query patterns and
35
+ the partition structure is defined in your
36
+ :datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`. For
37
+ the partition, choose fields that you query frequently and order them
38
+ from the most frequently queried in the first position to the least
39
+ queried field in the last position.
40
+
41
+ The order of fields listed in the
42
+ :datalakeconf:`databases.[n].collections.[n].dataSources.[n].path` is
43
+ important in the same way as it is in :manual:`Compound Indexes
44
+ </core/index-compound/>`. The specified path corresponds to data that
45
+ is partitioned first by the value of the first field, and then by the
46
+ value of the next field, and so on.
47
+
48
+ .. example::
49
+
50
+ Consider a collection with the ``software``, ``computer``, and
51
+ ``OS`` fields and partitions on the |s3| bucket named ``metrics``
52
+ first for the ``software`` field, followed by the ``computer``
53
+ field, and then the ``OS`` field.
54
+
55
+ .. code-block:: text
56
+ :copyable: false
57
+
58
+ metrics
59
+ |--software
60
+ |--computer
61
+ |--OS
62
+
63
+ {+adl+} uses the partitions for queries on the these fields:
64
+
65
+ - the ``software`` field,
66
+ - the ``software`` field and the ``computer`` field,
67
+ - the ``software`` field and the ``computer`` field and the
68
+ ``OS`` field.
69
+
70
+ {+adl+} can use the partitions to support a query on the
71
+ ``software`` and ``OS`` fields. However, in this case, {+adl+} is
72
+ not as efficient for the query as it would be if the query was on
73
+ the ``software`` and ``computer`` fields only. Partitions are parsed
74
+ in order; if a query omits a particular partition, {+adl+} is less
75
+ efficient in making use of any partitions that follow the partition.
76
+ Because a query on ``software`` and ``OS`` omits ``computer``,
77
+ {+adl+} uses the ``software`` partition more efficiently than the
78
+ ``OS`` partition to support this query.
79
+
80
+ {+adl+} can't use the partitions to support queries on fields not
81
+ specified in the
82
+ :datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`.
83
+ Also, {+adl+} can't use the partitions to support queries that
84
+ include the following fields without the ``software`` field:
85
+
86
+ - the ``computer`` field,
87
+ - the ``OS`` field, or
88
+ - the ``computer`` and ``OS`` fields.
89
+
90
+ You can use partitions to improve {+dl+} performance by mapping
91
+ them to partition attributes in your :doc:`configuration
92
+ </reference/format/data-lake-configuration>`. By mapping your
93
+ *partition attributes* (the parts of your |s3| prefix that looks like a
94
+ folder) to a query attribute, {+adl+} can selectively open the files
95
+ that contain data related to your query. This reduces the amount of
96
+ time a query takes and decreases cost, because {+dl+} reads and
97
+ downloads less files from |aws|.
45
98
46
99
.. example::
47
100
48
101
Consider an |s3| bucket ``metrics`` with the following structure:
49
102
50
103
.. code-block:: text
104
+ :copyable: false
51
105
52
106
metrics
53
107
|--hardware
@@ -66,9 +120,9 @@ and downloads less files from |aws|.
66
120
your configuration . If you issue a query that contains
67
121
``{metric_type: software, software_type: computer}``,
68
122
{+data-lake-short+} ignores files with the prefix ``/phone``.
69
-
123
+
70
124
For more information on mapping partition attributes to a collection
71
- :datalakeconf:`~ databases.[n].collections.[n].dataSources.[n].path`, see
125
+ :datalakeconf:`databases.[n].collections.[n].dataSources.[n].path`, see
72
126
:ref:`datalake-path-syntax`.
73
127
74
128
Data File Size
0 commit comments