Skip to content

chapter240_part30:240_Stopwords/40_Divide_and_conquer.asciidoc #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 6, 2017

Conversation

blogsit
Copy link

@blogsit blogsit commented Aug 23, 2016

1.翻译文字标题需要重新定义,多指教

However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be
required, you should use a `bool` query instead. As we saw in
<<stopwords-and>>, this is already an efficient query.
将操作符参数设置成 `and` 会要求所有低频词都必须匹配,同时对包含所有高频词的文档给予更高评分。但是,在匹配文档时,并不要求文档必须包含所有高频词,如果希望文档包含所有的低频和高频词,我们应该使用一个 `bool` 来替代。正如我们在 `and` 操作符(and Operator)<<stopwords-and>> 中看到的,它的查询效率已经很高了。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果希望文档包含所有的低频和高频词,=>逗号改为句号?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

逗号和句号都可以吧,看英文的原文是作为一句话在描述

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry,我标记错了,是这一句:“但是,在匹配文档时,并不要求文档必须包含所有高频词,“ 这句的末尾改为句号?

@xuej
Copy link

xuej commented Aug 24, 2016

LGTM

@medcl
Copy link
Member

medcl commented Oct 22, 2016

编译失败

@luotitan
Copy link

luotitan commented Nov 29, 2016

本地编译未通过 能帮忙异常贴出来一下吗?mac下软件太大

@blogsit
Copy link
Author

blogsit commented Dec 1, 2016

@luotitan 能帮忙异常贴出来一下吗?mac下软件太大

dead''. This approach greatly reduces the number of documents that need to be
examined and scored.
`must` 意味着至少有一个低频词 &#x2014;`quick` 或者 `dead` &#x2014;须出现在被匹配文档中。所有其他的文档被排除在外。`should` 语句查找高频词 `and` 和 `the`,但也只是在`must` 语句查询的结果集文档中查询。
`should`语句的唯一的工作就是在对如``Quick _and the_ dead''和``_The_ quick but dead''语句进行评分时,前者得分比后者高。这种方式可以大大减少需要进行评分计算的文档数量。

Copy link
Member

@pengqiuyuan pengqiuyuan Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must 意味着至少有一个低频词— quick 或者 dead —必须出现在被匹配文档中。所有其他的文档被排除在外。should 语句查找高频词 andthe,但也只是在 must 语句查询的结果集文档中查询。
should 语句的唯一的工作就是在对如 Quick _and the_ dead_The_ quick but dead 语句进行评分时,前者得分比后者高。这种方式可以大大减少需要进行评分计算的文档数量。

10 matches. We are really interested only in documents in which the terms all occur
together, so in the case where there are no low-frequency terms, the query is
rewritten to make all high-frequency terms required:
当使用 `or` 查询高频词条((("stopwords", "low and high frequency terms", "only high frequency terms"))),如(&#x2014;``To be, or not to be'')&#x2014;进行查询时性能最差。只是为了返回最匹配的前十个结果就对只是包含这些词的所有文档进行评分是盲目的。我们真正的意图是查询整个&#x2014;(``To be, or not to be'')&#x2014;词条出现的文档,所以在这种情况下,不存低频所言,这个查询需要重写为所有高频词条都必须:
Copy link
Member

@pengqiuyuan pengqiuyuan Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当使用 or 查询高频词条((("stopwords", "low and high frequency terms", "only high frequency terms"))),如— To be, or not to be —进行查询时性能最差。只是为了返回最匹配的前十个结果就对只是包含这些词的所有文档进行评分是盲目的。我们真正的意图是查询整个词条出现的文档,所以在这种情况下,不存低频所言,这个查询需要重写为所有高频词条都必须:

and less-important (high-frequency) terms.((("stopwords", "low and high frequency terms"))) Documents that match only the less
important terms are probably of very little interest. Really, we want
documents that match as many of the more important terms as possible.
在查询字符串中的词项可以分为更重要 (低频词) 和次重要 (高频词) 这两类。((("stopwords", "low and high frequency terms"))) 只与次重要词项匹配的文档很有可能不太相关。实际上,我们想要文档能尽可能多的匹配那些更重要的词项。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(低频词)(高频词) 英文括号。

considered almost meaningless. With the `stop` token filter, these domain-specific terms would have to be added to the stopwords list manually. However,
because the `cutoff_frequency` looks at the actual frequency of terms in the
index, these words would be classified as _high frequency_ automatically.
`cutoff_frequency` 配置的好处是,你在 _特定领域_ 使用停用词不受约束。((("domain specific stopwords")))((("stopwords", "domain specific")))例如,关于电影网站使用的词 _movie_ ,_color_ ,_black_ 和 _white_ ,这些词我们往往认为几乎没有任何意义。使用 `stop` 词汇单元过滤器,这些特定领域的词必须手动添加到停用词列表中。然而 `cutoff_frequency` 会查看索引里词项的具体频率,这些词会被自动归类为 _高频词汇_ 。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

例如,关于电影网站使用的词
英文逗号

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于电影网站使用的词 moviecolorblackwhite
修改为
关于电影网站使用的词 moviecolorblackwhite


*********************************************

Take this query as an example:
以下面查询为例:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以下面查询为例:
英文冒号

<1> Any term that occurs in more than 1% of documents is considered to be high
frequency. The `cutoff_frequency` can be specified as a fraction (`0.01`)
or as an absolute number (`5`).
<1> 任何词项出现在文档中超过1%,被认为是高频词。`cutoff_frequency` 配置可以指定为一个分数(`0.01`)或者一个正整数(`5`)。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文括号

This query uses the `cutoff_frequency` to first divide the query terms into a
low-frequency group (`quick`, `dead`) and a high-frequency group (`and`,
`the`). Then, the query is rewritten to produce the following `bool` query:
此查询通过`cutoff_frequency` 配置,将查询条件划分为低频组 (`quick`, `dead`)和高频组 (`and`,`the`)。然后,此查询会被重写为以下的`bool` 查询:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文括号

document like ``Quick _and the_ dead'' higher than ``_The_ quick but
dead''. This approach greatly reduces the number of documents that need to be
examined and scored.
`must` 意味着至少有一个低频词&#x2014; `quick` 或者 `dead` &#x2014;必须出现在被匹配文档中。所有其他的文档被排除在外。`should` 语句查找高频词 `and` 和 `the`,但也只是在 `must` 语句查询的结果集文档中查询。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

格式丢失,修改为
排除在外。 should 语句

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

格式丢失,修改为
高频词 andthe ,但也

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

格式丢失,修改为
但也只是在 must 语句

However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be
required, you should use a `bool` query instead. As we saw in
<<stopwords-and>>, this is already an efficient query.
将操作符参数设置成 `and` 会要求所有低频词都必须匹配,同时对包含所有高频词的文档给予更高评分。但是,在匹配文档时,并不要求文档必须包含所有高频词。如果希望文档包含所有的低频和高频词,我们应该使用一个 `bool` 来替代。正如我们在 `and` 操作符(and Operator)<<stopwords-and>> 中看到的,它的查询效率已经很高了。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

正如我们在<>中看到的,它的查询效率已经很高了。

<1> Because there are only two terms, the original 75% is rounded down
to `1`, that is: _one out of two low-terms must match_.
<2> The high-frequency terms are still optional and used only for scoring.
<1> 因为只有两个词,原来的75%向下取整为`1`,意思是:必须匹配低频词的两者之一。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

格式丢失,修改为
因为只有两个词,原来的75%向下取整为 1

@medcl
Copy link
Member

medcl commented Dec 30, 2016

@blogsit 自己本地预览一下,还有一些小问题

@blogsit
Copy link
Author

blogsit commented Jan 5, 2017

@medcl 把小问题帮忙截图贴出来看看。本地没有环境。谢谢

@medcl
Copy link
Member

medcl commented Jan 5, 2017

fireshot capture 24 - divide and conquer i elastics_ - file____users_medcl_documents_root

@medcl medcl merged commit 614a94e into elasticsearch-cn:cn Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants