-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Add client-side search for offline support #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can only comment for Lunr, but there is a plugin (and docs) that allows for search in different languages, so there is at least a start towards localisation. I'm also happy to help if there are any Lunr specific questions that you need answering, just let me know. |
Hey Oliver. Thanks for chiming in! Will be sure to reach out to you if we decide to go with Lunr and hit any snags. 😄 |
Iterated on this a bit more today. Created one plug-in that integrates with Lunr and one that integrates with js-worker-search. Both approaches work but the resulting index is quite large, even if I stem all words and filter duplicates ahead of time (which has the downside of breaking TF-IDF ranking). I'm starting to think this may not be feasible. |
I think I'm going to give up on this one for now. Making client-side search as useful as remote search just requires too much static data to be loaded in the browser, at least for a site as large as reactjs.org. Without the robust data, I can't implement useful features like:
I can reduce the amount of data needed but this cuts into the features previously mentioned. If anyone has ideas or wants to chat more about this please let me know though! |
@bvaughn ah, thats annoying! It sounds like you've already spent a bunch of time trying to get this to work, is it possible to share the current state in a branch of something? Also, just for interest, was it the serialised indexes that were too large? What size did they end up being? How many documents were part of the index? I'm not sure there is much I'll be able to do to reduce the size (though I'm up for a challenge!) but at the very least it would be an interesting dataset to use when testing Lunr for index size and speed. |
I could push my works in progress I guess, if you're really interested 😁 They were a bit dirty, since they were just proofs-of-concept.
In both cases, yes it was the index size. As for what I included- I indexed all markdown content. For lunr, I built an index of title + body text (with the HTML tags removed) and then dumped it to JSON (via For js-worker-search I tried to optimize more by pre-processing (during Gatsby build step). Basically I tokenized searchable text, filtered out stop words, stemmed, and then removed duplicates. (By removing duplicates I was giving up TF-IDF ranking but it made a big impact on size.) Then I created a condensed index file that used a custom text format (rather than JSON) to save on size and improve parsing time. This got the index size down to 605 KB which is still too big considering it lacked intelligent ranking. Both solutions also lacked the metadata to display drop-down typeahead results in the way Algolia does. |
Here's the guts of my implementations (not including the lunrconst {writeFileSync} = require('fs');
const lunr = require('lunr');
const {join} = require('path');
const sanitize = require('sanitize-html');
exports.createPages = async ({ graphql, boundActionCreators }) => {
const result = await graphql(query)
if (result.errors) {
throw new Error(result.errors.join(`, `))
}
const pages = [];
result.data.allMarkdownRemark.edges.forEach(edge => {
const html = edge.node.html;
const slug = edge.node.fields.slug;
const title = edge.node.frontmatter.title;
// Strip all HTML markup from searchable content
const text = sanitize(html, {
allowedTags: false,
allowedAttributes: false
});
pages.push({
id: slug,
text,
title,
});
});
// Pre-generate Lunr search index
const index = lunr(function() {
this.field('text');
this.field('title');
pages.forEach(page => {
this.add(page);
});
});
const path = join(__dirname, '../../public/search.index');
const data = JSON.stringify(index.toJSON());
writeFileSync(path, data);
};
const query = `
{
allMarkdownRemark {
edges {
node {
html
frontmatter {
title
}
fields {
slug
}
}
}
}
}
`; js-worker-searchconst {writeFileSync} = require('fs');
const {join} = require('path');
const sanitize = require('sanitize-html');
const stemmer = require('stemmer');
const StopWords = require('./stop-words');
const TOKENIZER_REGEX = /[^a-zа-яё0-9\-\.']+/i;
function tokenize(text) {
const uniqueWords = {};
return text
.split(TOKENIZER_REGEX) // Split words at boundaries
.filter(word => {
// Remove empty tokens and stop-words
return word != '' && StopWords[word] === undefined;
})
.map(word => {
// Stem and lower case (eg "Considerations" -> "consider")
return stemmer(word.toLocaleLowerCase());
})
.filter(word => {
// Remove duplicates so serialized format is smaller
// This means we can't later use TF-IDF ranking but maybe that's ok?
// If we decide later to use it let's pre-generate its metadata also.
if (uniqueWords[word] === undefined) {
uniqueWords[word] = true;
return true;
}
});
}
exports.createPages = async ({graphql, boundActionCreators}) => {
const {createNode} = boundActionCreators;
const result = await graphql(query);
if (result.errors) {
throw new Error(result.errors.join(`, `));
}
const searchData = [];
result.data.allMarkdownRemark.edges.forEach(edge => {
const html = edge.node.html;
const slug = edge.node.fields.slug;
const title = edge.node.frontmatter.title;
// Strip all HTML markup from searchable content
const text = sanitize(html, {
allowedTags: false,
allowedAttributes: false,
});
const index = tokenize(`${text} ${title}`).join(' ');
searchData.push(`${slug}\t${title}\t${index}`);
});
const path = join(__dirname, '../../public/search.index');
const data = searchData.join('\n');
writeFileSync(path, data);
};
const query = `
{
allMarkdownRemark {
edges {
node {
html
frontmatter {
title
}
fields {
slug
}
}
}
}
}
`; |
As for next steps, I found myself thinking about the types of information I would need to add back into the index in order to support more complex drop-down results display, and how I could maybe offset the increased size by dividing the indexes up into chunks and runtime-loading them based on the current search string. It was really starting to feel over-engineered at this point. Still a fun project but not a compelling drop-in replacement for a solution that works well enough already. Would love to know your thoughts though, @olivernn! |
@bvaughn what size would you like to see the index be =<? |
Well...it's hard to say. If I could add some half decent ranking in to my current data, then I could chunk the indexes up and JIT load them as a user typed. I think that's doable without a ton of effort, but... I'm not sure if people will really miss the inline context snippets Algolia shows for matching text. They are pretty useful. |
The code you have for Lunr looks idiomatic, I can't speak for js-worker-search code, but in both cases I doubt that how you build the index is going to have any dramatic impact on the index size.
Was this after compression? The structure of the serialised index probably lends itself reasonably well to some savings from gzip. Looking at the reactjs.org site I see that all assets are less than 500kb, so even with compression I think including the index is going to have a significant impact on the page weight. One of the things I've been thinking of for a while is to have a more compact serialised form of the index, probably some custom binary format. I don't have any thing concrete but I think getting the serialised index size down would be a huge win for a number of use cases. I only have a finite amount of time though so haven't got far with an implementation.
It should be possible to get something close with Lunr using the term positions, a basic example of this is shown in this demo, however it involves stuffing more data into the index, which isn't going to help with the size issue :( It looks like sticking with algolia is the right choice for now, but thanks for taking the time to try out Lunr, this kind of feedback is always invaluable. |
No. Both of the sizes I mentioned above were just size-on-disk. Gzip compresses the Lunr index down to 721 kB and the js-worker-search index down to 209 kB. (Edit: I was stripping HTML tags incorrect. Looks like the js-worker-search index is actually 131 kB.)
Yeah, it's possible with both search libs- but requires adding significantly more metadata to the index- which would in turn require splitting the index into chunks- which would add complexity, etc. 😦 Thanks a bunch for taking the time to talk this through though and share feedback. Greatly appreciated! |
This issue is being opened for discussion. We may do it if we deem it worth the effort, although we've currently disabled service-workers for reactjs.org so we don't have offline mode anyway.
There are several decent JavaScript search libraries (eg lunr,
js-worker-search, js-search) that could provide site-search in the browser without requiring a round-trip to Algolia.
One of these could also be turned into a Gatsby plug-in to enable the search index to be pre-built. I believe lunr supports this specifically so it might be worth giving extra attention to when considering options. The basic idea is that interesting content (eg title, body) could be fed into a search index during Gatbsy build and then serialized to a format that's quick to load in the browser at runtime.
If we do this, we should also defer loading the search library (and the serialized search index) unless needed (eg when a user clicks on the search bar).
Note that thought should also be given to localization, although I'm not sure exactly how that will work yet. Perhaps we should generate one search index per locale (during build time) and then j.i.t. load the index that matches the user's current locale.
The text was updated successfully, but these errors were encountered: