-
Notifications
You must be signed in to change notification settings - Fork 1.5k
chapter240_part30:240_Stopwords/40_Divide_and_conquer.asciidoc #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be | ||
required, you should use a `bool` query instead. As we saw in | ||
<<stopwords-and>>, this is already an efficient query. | ||
将操作符参数设置成 `and` 会要求所有低频词都必须匹配,同时对包含所有高频词的文档给予更高评分。但是,在匹配文档时,并不要求文档必须包含所有高频词,如果希望文档包含所有的低频和高频词,我们应该使用一个 `bool` 来替代。正如我们在 `and` 操作符(and Operator)<<stopwords-and>> 中看到的,它的查询效率已经很高了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果希望文档包含所有的低频和高频词,=>逗号改为句号?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
逗号和句号都可以吧,看英文的原文是作为一句话在描述
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry,我标记错了,是这一句:“但是,在匹配文档时,并不要求文档必须包含所有高频词,“ 这句的末尾改为句号?
LGTM |
编译失败 |
本地编译未通过 能帮忙异常贴出来一下吗?mac下软件太大 |
@luotitan 能帮忙异常贴出来一下吗?mac下软件太大 |
dead''. This approach greatly reduces the number of documents that need to be | ||
examined and scored. | ||
`must` 意味着至少有一个低频词 —`quick` 或者 `dead` —须出现在被匹配文档中。所有其他的文档被排除在外。`should` 语句查找高频词 `and` 和 `the`,但也只是在`must` 语句查询的结果集文档中查询。 | ||
`should`语句的唯一的工作就是在对如``Quick _and the_ dead''和``_The_ quick but dead''语句进行评分时,前者得分比后者高。这种方式可以大大减少需要进行评分计算的文档数量。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must
意味着至少有一个低频词— quick
或者 dead
—必须出现在被匹配文档中。所有其他的文档被排除在外。should
语句查找高频词 and
和 the
,但也只是在 must
语句查询的结果集文档中查询。
should
语句的唯一的工作就是在对如 Quick _and the_ dead
和 _The_ quick but dead
语句进行评分时,前者得分比后者高。这种方式可以大大减少需要进行评分计算的文档数量。
10 matches. We are really interested only in documents in which the terms all occur | ||
together, so in the case where there are no low-frequency terms, the query is | ||
rewritten to make all high-frequency terms required: | ||
当使用 `or` 查询高频词条((("stopwords", "low and high frequency terms", "only high frequency terms"))),如(—``To be, or not to be'')—进行查询时性能最差。只是为了返回最匹配的前十个结果就对只是包含这些词的所有文档进行评分是盲目的。我们真正的意图是查询整个—(``To be, or not to be'')—词条出现的文档,所以在这种情况下,不存低频所言,这个查询需要重写为所有高频词条都必须: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当使用 or
查询高频词条((("stopwords", "low and high frequency terms", "only high frequency terms"))),如— To be, or not to be
—进行查询时性能最差。只是为了返回最匹配的前十个结果就对只是包含这些词的所有文档进行评分是盲目的。我们真正的意图是查询整个词条出现的文档,所以在这种情况下,不存低频所言,这个查询需要重写为所有高频词条都必须:
and less-important (high-frequency) terms.((("stopwords", "low and high frequency terms"))) Documents that match only the less | ||
important terms are probably of very little interest. Really, we want | ||
documents that match as many of the more important terms as possible. | ||
在查询字符串中的词项可以分为更重要 (低频词) 和次重要 (高频词) 这两类。((("stopwords", "low and high frequency terms"))) 只与次重要词项匹配的文档很有可能不太相关。实际上,我们想要文档能尽可能多的匹配那些更重要的词项。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(低频词)(高频词) 英文括号。
considered almost meaningless. With the `stop` token filter, these domain-specific terms would have to be added to the stopwords list manually. However, | ||
because the `cutoff_frequency` looks at the actual frequency of terms in the | ||
index, these words would be classified as _high frequency_ automatically. | ||
`cutoff_frequency` 配置的好处是,你在 _特定领域_ 使用停用词不受约束。((("domain specific stopwords")))((("stopwords", "domain specific")))例如,关于电影网站使用的词 _movie_ ,_color_ ,_black_ 和 _white_ ,这些词我们往往认为几乎没有任何意义。使用 `stop` 词汇单元过滤器,这些特定领域的词必须手动添加到停用词列表中。然而 `cutoff_frequency` 会查看索引里词项的具体频率,这些词会被自动归类为 _高频词汇_ 。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
例如,关于电影网站使用的词
英文逗号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于电影网站使用的词 movie ,color ,black 和 white ,
修改为
关于电影网站使用的词 movie 、 color 、 black 和 white ,
|
||
********************************************* | ||
|
||
Take this query as an example: | ||
以下面查询为例: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以下面查询为例:
英文冒号
<1> Any term that occurs in more than 1% of documents is considered to be high | ||
frequency. The `cutoff_frequency` can be specified as a fraction (`0.01`) | ||
or as an absolute number (`5`). | ||
<1> 任何词项出现在文档中超过1%,被认为是高频词。`cutoff_frequency` 配置可以指定为一个分数(`0.01`)或者一个正整数(`5`)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文括号
This query uses the `cutoff_frequency` to first divide the query terms into a | ||
low-frequency group (`quick`, `dead`) and a high-frequency group (`and`, | ||
`the`). Then, the query is rewritten to produce the following `bool` query: | ||
此查询通过`cutoff_frequency` 配置,将查询条件划分为低频组 (`quick`, `dead`)和高频组 (`and`,`the`)。然后,此查询会被重写为以下的`bool` 查询: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文括号
document like ``Quick _and the_ dead'' higher than ``_The_ quick but | ||
dead''. This approach greatly reduces the number of documents that need to be | ||
examined and scored. | ||
`must` 意味着至少有一个低频词— `quick` 或者 `dead` —必须出现在被匹配文档中。所有其他的文档被排除在外。`should` 语句查找高频词 `and` 和 `the`,但也只是在 `must` 语句查询的结果集文档中查询。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
格式丢失,修改为
排除在外。 should
语句
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
格式丢失,修改为
高频词 and
和 the
,但也
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
格式丢失,修改为
但也只是在 must
语句
However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be | ||
required, you should use a `bool` query instead. As we saw in | ||
<<stopwords-and>>, this is already an efficient query. | ||
将操作符参数设置成 `and` 会要求所有低频词都必须匹配,同时对包含所有高频词的文档给予更高评分。但是,在匹配文档时,并不要求文档必须包含所有高频词。如果希望文档包含所有的低频和高频词,我们应该使用一个 `bool` 来替代。正如我们在 `and` 操作符(and Operator)<<stopwords-and>> 中看到的,它的查询效率已经很高了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
正如我们在<>中看到的,它的查询效率已经很高了。
<1> Because there are only two terms, the original 75% is rounded down | ||
to `1`, that is: _one out of two low-terms must match_. | ||
<2> The high-frequency terms are still optional and used only for scoring. | ||
<1> 因为只有两个词,原来的75%向下取整为`1`,意思是:必须匹配低频词的两者之一。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
格式丢失,修改为
因为只有两个词,原来的75%向下取整为 1
,
@blogsit 自己本地预览一下,还有一些小问题 |
@medcl 把小问题帮忙截图贴出来看看。本地没有环境。谢谢 |
1.翻译文字标题需要重新定义,多指教