You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 240_Stopwords/40_Divide_and_conquer.asciidoc
+25-62Lines changed: 25 additions & 62 deletions
Original file line number
Diff line number
Diff line change
@@ -1,31 +1,19 @@
1
1
[[common-terms]]
2
-
=== Divide and Conquer
2
+
=== 词项的分别管理(Divide and Conquer)
3
3
4
-
The terms in a query string can be divided into more-important (low-frequency)
5
-
and less-important (high-frequency) terms.((("stopwords", "low and high frequency terms"))) Documents that match only the less
6
-
important terms are probably of very little interest. Really, we want
7
-
documents that match as many of the more important terms as possible.
4
+
在查询字符串中的词项可以分为更重要(低频词)和次重要(高频词)这两类。((("stopwords", "low and high frequency terms"))) 只与次重要词项匹配的文档很有可能不太相关。实际上,我们想要文档能尽可能多的匹配那些更重要的词项。
8
5
9
-
The `match` query accepts ((("cutoff_frequency parameter")))((("match query", "cutoff_frequency parameter")))a `cutoff_frequency` parameter, which allows it to
10
-
divide the terms in the query string into a low-frequency and high-frequency
11
-
group.((("term frequency", "cutoff_frequency parameter in match query"))) The low-frequency group (more-important terms) form the bulk of the
12
-
query, while the high-frequency group (less-important terms) is used only for
13
-
scoring, not for matching. By treating these two groups differently, we can
14
-
gain a real boost of speed on previously slow queries.
15
6
16
-
.Domain-Specific Stopwords
7
+
`match` 查询接受一个参数 ((("cutoff_frequency parameter")))((("match query", "cutoff_frequency parameter")))`cutoff_frequency` ,从而可以让它将查询字符串里的词项分为低频和高频两组。((("term frequency", "cutoff_frequency parameter in match query")))低频组(更重要的词项)组成 `bulk` 大量查询条件,而高频组(次重要的词项)只会用来评分,而不参与匹配过程。通过对这两组词的区分处理,我们可以在之前慢查询的基础上获得巨大的速度提升。
8
+
9
+
领域相关的停用词(Domain-Specific Stopwords)
17
10
*********************************************
18
11
19
-
One of the benefits of `cutoff_frequency` is that you get _domain-specific_
20
-
stopwords for free.((("domain specific stopwords")))((("stopwords", "domain specific"))) For instance, a website about movies may use the words
21
-
_movie_, _color_, _black_, and _white_ so often that they could be
22
-
considered almost meaningless. With the `stop` token filter, these domain-specific terms would have to be added to the stopwords list manually. However,
23
-
because the `cutoff_frequency` looks at the actual frequency of terms in the
24
-
index, these words would be classified as _high frequency_ automatically.
@@ -68,32 +52,21 @@ low-frequency group (`quick`, `dead`) and a high-frequency group (`and`,
68
52
}
69
53
}
70
54
---------------------------------
71
-
<1> At least one low-frequency/high-importance term _must_ match.
72
-
<2> High-frequency/low-importance terms are entirely optional.
55
+
<1> 必须匹配至少一个低频/更重要的词项。
56
+
<2> 高频/次重要性词项是非必须的。
73
57
74
-
The `must` clause means that at least one of the low-frequency terms—`quick` or `dead`—_must_ be present for a document to be considered a
75
-
match. All other documents are excluded. The `should` clause then looks for
76
-
the high-frequency terms `and` and `the`, but only in the documents collected
77
-
by the `must` clause. The sole job of the `should` clause is to score a
78
-
document like ``Quick _and the_ dead'' higher than ``_The_ quick but
79
-
dead''. This approach greatly reduces the number of documents that need to be
The `minimum_should_match` parameter can be combined with `cutoff_frequency`
96
-
but it applies to only the low-frequency terms.((("stopwords", "low and high frequency terms", "controlling precision")))((("minimum_should_match parameter", "controlling precision"))) This query:
68
+
==== 控制精度
69
+
`minimum_should_match` 参数可以与 `cutoff_frequency` 组合使用,但是此参数仅适用与低频词。((("stopwords", "low and high frequency terms", "controlling precision")))((("minimum_should_match parameter", "controlling precision")))如以下查询:
97
70
98
71
[source,json]
99
72
---------------------------------
@@ -107,7 +80,7 @@ but it applies to only the low-frequency terms.((("stopwords", "low and high fre
107
80
}
108
81
---------------------------------
109
82
110
-
would be rewritten as follows:
83
+
将被重写为如下所示:
111
84
112
85
[source,json]
113
86
---------------------------------
@@ -133,18 +106,12 @@ would be rewritten as follows:
133
106
}
134
107
}
135
108
---------------------------------
136
-
<1> Because there are only two terms, the original 75% is rounded down
137
-
to `1`, that is: _one out of two low-terms must match_.
138
-
<2> The high-frequency terms are still optional and used only for scoring.
109
+
<1> 因为只有两个词,原来的75%向下取整为 `1` ,意思是:必须匹配低频词的两者之一。
110
+
<2> 高频词仍可选的,并且仅用于评分使用。
139
111
140
-
==== Only High-Frequency Terms
112
+
==== 高频词
141
113
142
-
An `or` query for high-frequency((("stopwords", "low and high frequency terms", "only high frequency terms"))) terms only—``To be, or not to be''—is
143
-
the worst case for performance. It is pointless to score _all_ the
144
-
documents that contain only one of these terms in order to return just the top
145
-
10 matches. We are really interested only in documents in which the terms all occur
146
-
together, so in the case where there are no low-frequency terms, the query is
147
-
rewritten to make all high-frequency terms required:
114
+
当使用 `or` 查询高频词条((("stopwords", "low and high frequency terms", "only high frequency terms"))),如— `To be, or not to be` —进行查询时性能最差。只是为了返回最匹配的前十个结果就对只是包含这些词的所有文档进行评分是盲目的。我们真正的意图是查询整个词条出现的文档,所以在这种情况下,不存低频所言,这个查询需要重写为所有高频词条都必须:
148
115
149
116
[source,json]
150
117
---------------------------------
@@ -162,15 +129,11 @@ rewritten to make all high-frequency terms required:
162
129
}
163
130
---------------------------------
164
131
165
-
==== More Control with Common Terms
132
+
==== 对常用词使用更多控制(More Control with Common Terms)
166
133
167
-
While the high/low frequency functionality in the `match` query is useful,
168
-
sometimes you want more control((("stopwords", "low and high frequency terms", "more control over common terms"))) over how the high- and low-frequency groups
169
-
should be handled. The `match` query exposes a subset of the
170
-
functionality available in the `common` terms query.((("common terms query")))
134
+
尽管高频/低频的功能在 `match` 查询中是有用的,有时我们还希望能对它((("stopwords", "low and high frequency terms", "more control over common terms")))有更多的控制,想控制它对高频和低频词分组的行为。 `match` 查询针对 ((("common terms query"))) `common` 词项查询提供了一组功能。
171
135
172
-
For instance, we could make all low-frequency terms required, and score only
173
-
documents that have 75% of all high-frequency terms with a query like this:
136
+
例如,我们可以让所有低频词都必须匹配,而只对那些包括超过 75% 的高频词文档进行评分:
174
137
175
138
[source,json]
176
139
---------------------------------
@@ -188,5 +151,5 @@ documents that have 75% of all high-frequency terms with a query like this:
188
151
}
189
152
---------------------------------
190
153
191
-
See the {ref}/query-dsl-common-terms-query.html[`common` terms query] reference page for more options.
0 commit comments