Skip to content

Commit 8f7b52a

Browse files
committed
rustdoc-search: use set ops for ranking and filtering
This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures. Motivation ---------- If I write a query like `str -> String`, a lot of functions come up. That's to be expected, but `String::from_str` should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. `slice::join` even comes up above it! To fix this, the sorting should take into account the function's signature, and the closer match should come up on top. Guide-level description ----------------------- When searching by type signature, types with a "closer" match will show up above types that match less precisely. Reference-level explanation --------------------------- Functions signature search works in three major phases: * A compact "fingerprint," based on the [bloom filter] technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification. The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once. The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found. * The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm. The result is rejected if any query elements are absent in the function. [bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter Drawbacks --------- This makes the code bigger. More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types. The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now). Rationale and alternatives -------------------------- Throwing a bloom filter in front makes it faster. More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum. If I didn't want to be able to cheaply do set operations on the fingerprint, a [cuckoo filter] is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK. I also looked into [minhashing], but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high). This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because: * The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}. * There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine. Prior art --------- This is significantly different from how Hoogle does it. It doesn't account for order, and it has no special account for nesting, though `Box<t>` is still two items, while `t` is only one. This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute. Unresolved questions -------------------- `[]` and `()`, the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem? Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like `fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it on top even if there's another function with `fn bar(x: Nothing)` that's technically a closer match in type signature. Future possibilities -------------------- It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification. [cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter [minhashing]: https://en.wikipedia.org/wiki/MinHash
1 parent 398ea33 commit 8f7b52a

File tree

9 files changed

+318
-73
lines changed

9 files changed

+318
-73
lines changed

src/librustdoc/html/static/js/externs.js

+2-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ function initSearch(searchIndex){}
1414
* pathWithoutLast: Array<string>,
1515
* pathLast: string,
1616
* generics: Array<QueryElement>,
17-
* bindings: Map<(string|integer), Array<QueryElement>>,
17+
* bindings: Map<integer, Array<QueryElement>>,
1818
* }}
1919
*/
2020
let QueryElement;
@@ -42,6 +42,7 @@ let ParserState;
4242
* totalElems: number,
4343
* literalSearch: boolean,
4444
* corrections: Array<{from: string, to: integer}>,
45+
* typeFingerprint: Uint32Array,
4546
* }}
4647
*/
4748
let ParsedQuery;

src/librustdoc/html/static/js/search.js

+199-57
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,10 @@ function initSearch(rawSearchIndex) {
238238
* @type {Array<Row>}
239239
*/
240240
let searchIndex;
241+
/**
242+
* @type {Uint32Array}
243+
*/
244+
let functionTypeFingerprint;
241245
let currentResults;
242246
/**
243247
* Map from normalized type names to integers. Used to make type search
@@ -1043,6 +1047,8 @@ function initSearch(rawSearchIndex) {
10431047
correction: null,
10441048
proposeCorrectionFrom: null,
10451049
proposeCorrectionTo: null,
1050+
// bloom filter build from type ids
1051+
typeFingerprint: new Uint32Array(4),
10461052
};
10471053
}
10481054

@@ -1138,7 +1144,6 @@ function initSearch(rawSearchIndex) {
11381144
query.error = err;
11391145
return query;
11401146
}
1141-
11421147
if (!query.literalSearch) {
11431148
// If there is more than one element in the query, we switch to literalSearch in any
11441149
// case.
@@ -1946,8 +1951,7 @@ function initSearch(rawSearchIndex) {
19461951
* @param {integer} path_dist
19471952
*/
19481953
function addIntoResults(results, fullId, id, index, dist, path_dist, maxEditDistance) {
1949-
const inBounds = dist <= maxEditDistance || index !== -1;
1950-
if (dist === 0 || (!parsedQuery.literalSearch && inBounds)) {
1954+
if (dist <= maxEditDistance || index !== -1) {
19511955
if (results.has(fullId)) {
19521956
const result = results.get(fullId);
19531957
if (result.dontValidate || result.dist <= dist) {
@@ -1995,17 +1999,37 @@ function initSearch(rawSearchIndex) {
19951999
const fullId = row.id;
19962000
const searchWord = searchWords[pos];
19972001

1998-
const in_args = row.type && row.type.inputs
1999-
&& checkIfInList(row.type.inputs, elem, row.type.where_clause);
2000-
if (in_args) {
2001-
// path_dist is 0 because no parent path information is currently stored
2002-
// in the search index
2003-
addIntoResults(results_in_args, fullId, pos, -1, 0, 0, maxEditDistance);
2004-
}
2005-
const returned = row.type && row.type.output
2006-
&& checkIfInList(row.type.output, elem, row.type.where_clause);
2007-
if (returned) {
2008-
addIntoResults(results_returned, fullId, pos, -1, 0, 0, maxEditDistance);
2002+
// fpDist is a minimum possible type distance, where "type distance" is the number of
2003+
// atoms in the function not present in the query
2004+
const tfpDist = compareTypeFingerprints(
2005+
fullId,
2006+
parsedQuery.typeFingerprint
2007+
);
2008+
if (tfpDist !== null &&
2009+
!(results_in_args.size >= MAX_RESULTS && tfpDist > results_in_args.max_dist)
2010+
) {
2011+
const in_args = row.type && row.type.inputs
2012+
&& checkIfInList(row.type.inputs, elem, row.type.where_clause);
2013+
if (in_args) {
2014+
results_in_args.max_dist = Math.max(results_in_args.max_dist || 0, tfpDist);
2015+
const maxDist = results_in_args.size < MAX_RESULTS ?
2016+
(tfpDist + 1) :
2017+
results_in_args.max_dist;
2018+
addIntoResults(results_in_args, fullId, pos, -1, tfpDist, 0, maxDist);
2019+
}
2020+
}
2021+
if (tfpDist !== false &&
2022+
!(results_returned.size >= MAX_RESULTS && tfpDist > results_returned.max_dist)
2023+
) {
2024+
const returned = row.type && row.type.output
2025+
&& checkIfInList(row.type.output, elem, row.type.where_clause);
2026+
if (returned) {
2027+
results_returned.max_dist = Math.max(results_returned.max_dist || 0, tfpDist);
2028+
const maxDist = results_returned.size < MAX_RESULTS ?
2029+
(tfpDist + 1) :
2030+
results_returned.max_dist;
2031+
addIntoResults(results_returned, fullId, pos, -1, tfpDist, 0, maxDist);
2032+
}
20092033
}
20102034

20112035
if (!typePassesFilter(elem.typeFilter, row.ty)) {
@@ -2064,6 +2088,17 @@ function initSearch(rawSearchIndex) {
20642088
return;
20652089
}
20662090

2091+
const tfpDist = compareTypeFingerprints(
2092+
row.id,
2093+
parsedQuery.typeFingerprint
2094+
);
2095+
if (tfpDist === null) {
2096+
return;
2097+
}
2098+
if (results.size >= MAX_RESULTS && tfpDist > results.max_dist) {
2099+
return;
2100+
}
2101+
20672102
// If the result is too "bad", we return false and it ends this search.
20682103
if (!unifyFunctionTypes(
20692104
row.type.inputs,
@@ -2082,11 +2117,12 @@ function initSearch(rawSearchIndex) {
20822117
return;
20832118
}
20842119

2085-
addIntoResults(results, row.id, pos, 0, 0, 0, Number.MAX_VALUE);
2120+
results.max_dist = Math.max(results.max_dist || 0, tfpDist);
2121+
addIntoResults(results, row.id, pos, 0, tfpDist, 0, Number.MAX_VALUE);
20862122
}
20872123

20882124
function innerRunQuery() {
2089-
let elem, i, nSearchWords, in_returned, row;
2125+
let elem, i, nSearchWords;
20902126

20912127
let queryLen = 0;
20922128
for (const elem of parsedQuery.elems) {
@@ -2204,50 +2240,30 @@ function initSearch(rawSearchIndex) {
22042240
);
22052241
}
22062242

2243+
const fps = new Set();
22072244
for (const elem of parsedQuery.elems) {
22082245
convertNameToId(elem);
2246+
buildFunctionTypeFingerprint(elem, parsedQuery.typeFingerprint, fps);
22092247
}
22102248
for (const elem of parsedQuery.returned) {
22112249
convertNameToId(elem);
2250+
buildFunctionTypeFingerprint(elem, parsedQuery.typeFingerprint, fps);
22122251
}
22132252

2214-
if (parsedQuery.foundElems === 1) {
2215-
if (parsedQuery.elems.length === 1) {
2216-
elem = parsedQuery.elems[0];
2217-
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
2218-
// It means we want to check for this element everywhere (in names, args and
2219-
// returned).
2220-
handleSingleArg(
2221-
searchIndex[i],
2222-
i,
2223-
elem,
2224-
results_others,
2225-
results_in_args,
2226-
results_returned,
2227-
maxEditDistance
2228-
);
2229-
}
2230-
} else if (parsedQuery.returned.length === 1) {
2231-
// We received one returned argument to check, so looking into returned values.
2232-
elem = parsedQuery.returned[0];
2233-
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
2234-
row = searchIndex[i];
2235-
in_returned = row.type && unifyFunctionTypes(
2236-
row.type.output,
2237-
parsedQuery.returned,
2238-
row.type.where_clause
2239-
);
2240-
if (in_returned) {
2241-
addIntoResults(
2242-
results_others,
2243-
row.id,
2244-
i,
2245-
-1,
2246-
0,
2247-
Number.MAX_VALUE
2248-
);
2249-
}
2250-
}
2253+
if (parsedQuery.foundElems === 1 && parsedQuery.returned.length === 0) {
2254+
elem = parsedQuery.elems[0];
2255+
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
2256+
// It means we want to check for this element everywhere (in names, args and
2257+
// returned).
2258+
handleSingleArg(
2259+
searchIndex[i],
2260+
i,
2261+
elem,
2262+
results_others,
2263+
results_in_args,
2264+
results_returned,
2265+
maxEditDistance
2266+
);
22512267
}
22522268
} else if (parsedQuery.foundElems > 0) {
22532269
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
@@ -2794,6 +2810,97 @@ ${item.displayPath}<span class="${type}">${name}</span>\
27942810
};
27952811
}
27962812

2813+
/**
2814+
* Type fingerprints allow fast, approximate matching of types.
2815+
*
2816+
* This algo creates a compact representation of the type set using a Bloom filter.
2817+
* This fingerprint is used three ways:
2818+
*
2819+
* - It accelerates the matching algorithm by checking the function fingerprint against the
2820+
* query fingerprint. If any bits are set in the query but not in the function, it can't
2821+
* match.
2822+
*
2823+
* - The fourth section has the number of distinct items in the set.
2824+
* This is the distance function, used for filtering and for sorting.
2825+
*
2826+
* [^1]: Distance is the relatively naive metric of counting the number of distinct items in
2827+
* the function that are not present in the query.
2828+
*
2829+
* @param {FunctionType|QueryElement} type - a single type
2830+
* @param {Uint32Array} output - write the fingerprint to this data structure: uses 128 bits
2831+
* @param {Set<number>} fps - Set of distinct items
2832+
*/
2833+
function buildFunctionTypeFingerprint(type, output, fps) {
2834+
2835+
let input = type.id;
2836+
// All forms of `[]` get collapsed down to one thing in the bloom filter.
2837+
// Differentiating between arrays and slices, if the user asks for it, is
2838+
// still done in the matching algorithm.
2839+
if (input === typeNameIdOfArray || input === typeNameIdOfSlice) {
2840+
input = typeNameIdOfArrayOrSlice;
2841+
}
2842+
if (input !== null) {
2843+
// https://docs.rs/rustc-hash/1.1.0/src/rustc_hash/lib.rs.html#60
2844+
// Rotate is skipped because we're only doing one cycle anyway.
2845+
const h0 = Math.imul(input, 0x9e3779b9);
2846+
const h1 = Math.imul(479001599 ^ input, 0x9e3779b9);
2847+
const h2 = Math.imul(433494437 ^ input, 0x9e3779b9);
2848+
output[0] |= 1 << (h0 % 32);
2849+
output[1] |= 1 << (h1 % 32);
2850+
output[2] |= 1 << (h2 % 32);
2851+
fps.add(input);
2852+
}
2853+
for (const g of type.generics) {
2854+
buildFunctionTypeFingerprint(g, output, fps);
2855+
}
2856+
const fb = {
2857+
id: null,
2858+
ty: 0,
2859+
generics: [],
2860+
bindings: new Map(),
2861+
};
2862+
for (const [k, v] of type.bindings.entries()) {
2863+
fb.id = k;
2864+
fb.generics = v;
2865+
buildFunctionTypeFingerprint(fb, output, fps);
2866+
}
2867+
output[3] = fps.size;
2868+
}
2869+
2870+
/**
2871+
* Compare the query fingerprint with the function fingerprint.
2872+
*
2873+
* @param {{number}} fullId - The function
2874+
* @param {{Uint32Array}} queryFingerprint - The query
2875+
* @returns {number|null} - Null if non-match, number if distance
2876+
* This function might return 0!
2877+
*/
2878+
function compareTypeFingerprints(fullId, queryFingerprint) {
2879+
2880+
const fh0 = functionTypeFingerprint[fullId * 4];
2881+
const fh1 = functionTypeFingerprint[(fullId * 4) + 1];
2882+
const fh2 = functionTypeFingerprint[(fullId * 4) + 2];
2883+
const [qh0, qh1, qh2] = queryFingerprint;
2884+
// Approximate set intersection with bloom filters.
2885+
// This can be larger than reality, not smaller, because hashes have
2886+
// the property that if they've got the same value, they hash to the
2887+
// same thing. False positives exist, but not false negatives.
2888+
const [in0, in1, in2] = [fh0 & qh0, fh1 & qh1, fh2 & qh2];
2889+
// Approximate the set of items in the query but not the function.
2890+
// This might be smaller than reality, but cannot be bigger.
2891+
//
2892+
// | in_ | qh_ | XOR | Meaning |
2893+
// | --- | --- | --- | ------------------------------------------------ |
2894+
// | 0 | 0 | 0 | Not present |
2895+
// | 1 | 0 | 1 | IMPOSSIBLE because `in_` is `fh_ & qh_` |
2896+
// | 1 | 1 | 0 | If one or both is false positive, false negative |
2897+
// | 0 | 1 | 1 | Since in_ has no false negatives, must be real |
2898+
if ((in0 ^ qh0) || (in1 ^ qh1) || (in2 ^ qh2)) {
2899+
return null;
2900+
}
2901+
return functionTypeFingerprint[(fullId * 4) + 3];
2902+
}
2903+
27972904
function buildIndex(rawSearchIndex) {
27982905
searchIndex = [];
27992906
/**
@@ -2813,6 +2920,22 @@ ${item.displayPath}<span class="${type}">${name}</span>\
28132920
typeNameIdOfSlice = buildTypeMapIndex("slice");
28142921
typeNameIdOfArrayOrSlice = buildTypeMapIndex("[]");
28152922

2923+
// Function type fingerprints are 128-bit bloom filters that are used to
2924+
// estimate the distance between function and query.
2925+
// This loop counts the number of items to allocate a fingerprint for.
2926+
for (const crate in rawSearchIndex) {
2927+
if (!hasOwnPropertyRustdoc(rawSearchIndex, crate)) {
2928+
continue;
2929+
}
2930+
// Each item gets an entry in the fingerprint array, and the crate
2931+
// does, too
2932+
id += rawSearchIndex[crate].t.length + 1;
2933+
}
2934+
functionTypeFingerprint = new Uint32Array((id + 1) * 4);
2935+
2936+
// This loop actually generates the search item indexes, including
2937+
// normalized names, type signature objects and fingerprints, and aliases.
2938+
id = 0;
28162939
for (const crate in rawSearchIndex) {
28172940
if (!hasOwnPropertyRustdoc(rawSearchIndex, crate)) {
28182941
continue;
@@ -2962,17 +3085,36 @@ ${item.displayPath}<span class="${type}">${name}</span>\
29623085
}
29633086
searchWords.push(word);
29643087
const path = itemPaths.has(i) ? itemPaths.get(i) : lastPath;
3088+
let type = null;
3089+
if (itemFunctionSearchTypes[i] !== 0) {
3090+
type = buildFunctionSearchType(
3091+
itemFunctionSearchTypes[i],
3092+
lowercasePaths
3093+
);
3094+
if (type) {
3095+
const fp = functionTypeFingerprint.subarray(id * 4, (id + 1) * 4);
3096+
const fps = new Set();
3097+
for (const t of type.inputs) {
3098+
buildFunctionTypeFingerprint(t, fp, fps);
3099+
}
3100+
for (const t of type.output) {
3101+
buildFunctionTypeFingerprint(t, fp, fps);
3102+
}
3103+
for (const w of type.where_clause) {
3104+
for (const t of w) {
3105+
buildFunctionTypeFingerprint(t, fp, fps);
3106+
}
3107+
}
3108+
}
3109+
}
29653110
const row = {
29663111
crate: crate,
29673112
ty: itemTypes.charCodeAt(i) - charA,
29683113
name: itemNames[i],
29693114
path: path,
29703115
desc: itemDescs[i],
29713116
parent: itemParentIdxs[i] > 0 ? paths[itemParentIdxs[i] - 1] : undefined,
2972-
type: buildFunctionSearchType(
2973-
itemFunctionSearchTypes[i],
2974-
lowercasePaths
2975-
),
3117+
type,
29763118
id: id,
29773119
normalizedName: word.indexOf("_") === -1 ? word : word.replace(/_/g, ""),
29783120
deprecated: deprecatedItems.has(i),

tests/rustdoc-js/assoc-type.js

+3-3
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@ const EXPECTED = [
77
'query': 'iterator<something> -> u32',
88
'correction': null,
99
'others': [
10-
{ 'path': 'assoc_type', 'name': 'my_fn' },
1110
{ 'path': 'assoc_type::my', 'name': 'other_fn' },
11+
{ 'path': 'assoc_type', 'name': 'my_fn' },
1212
],
1313
},
1414
{
1515
'query': 'iterator<something>',
1616
'correction': null,
1717
'in_args': [
18-
{ 'path': 'assoc_type', 'name': 'my_fn' },
1918
{ 'path': 'assoc_type::my', 'name': 'other_fn' },
19+
{ 'path': 'assoc_type', 'name': 'my_fn' },
2020
],
2121
},
2222
{
@@ -26,8 +26,8 @@ const EXPECTED = [
2626
{ 'path': 'assoc_type', 'name': 'Something' },
2727
],
2828
'in_args': [
29-
{ 'path': 'assoc_type', 'name': 'my_fn' },
3029
{ 'path': 'assoc_type::my', 'name': 'other_fn' },
30+
{ 'path': 'assoc_type', 'name': 'my_fn' },
3131
],
3232
},
3333
// if I write an explicit binding, only it shows up

0 commit comments

Comments
 (0)