Skip to content

Add client-side search for offline support #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bvaughn opened this issue Oct 9, 2017 · 12 comments
Closed

Add client-side search for offline support #98

bvaughn opened this issue Oct 9, 2017 · 12 comments

Comments

@bvaughn
Copy link
Contributor

bvaughn commented Oct 9, 2017

This issue is being opened for discussion. We may do it if we deem it worth the effort, although we've currently disabled service-workers for reactjs.org so we don't have offline mode anyway.

There are several decent JavaScript search libraries (eg lunr,
js-worker-search, js-search) that could provide site-search in the browser without requiring a round-trip to Algolia.

One of these could also be turned into a Gatsby plug-in to enable the search index to be pre-built. I believe lunr supports this specifically so it might be worth giving extra attention to when considering options. The basic idea is that interesting content (eg title, body) could be fed into a search index during Gatbsy build and then serialized to a format that's quick to load in the browser at runtime.

If we do this, we should also defer loading the search library (and the serialized search index) unless needed (eg when a user clicks on the search bar).

Note that thought should also be given to localization, although I'm not sure exactly how that will work yet. Perhaps we should generate one search index per locale (during build time) and then j.i.t. load the index that matches the user's current locale.

@olivernn
Copy link

Note that thought should also be given to localization although that seems to be a much harder task, potentially.

I can only comment for Lunr, but there is a plugin (and docs) that allows for search in different languages, so there is at least a start towards localisation.

I'm also happy to help if there are any Lunr specific questions that you need answering, just let me know.

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 11, 2017

Hey Oliver. Thanks for chiming in! Will be sure to reach out to you if we decide to go with Lunr and hit any snags. 😄

@bvaughn bvaughn self-assigned this Oct 22, 2017
@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 22, 2017

Iterated on this a bit more today. Created one plug-in that integrates with Lunr and one that integrates with js-worker-search. Both approaches work but the resulting index is quite large, even if I stem all words and filter duplicates ahead of time (which has the downside of breaking TF-IDF ranking). I'm starting to think this may not be feasible. ☹️

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 22, 2017

I think I'm going to give up on this one for now. Making client-side search as useful as remote search just requires too much static data to be loaded in the browser, at least for a site as large as reactjs.org. Without the robust data, I can't implement useful features like:

  • TF-IDF ranked results (extra important since we're only showing a couple of results)
  • Inline results text-snippets with matching terms highlighted

I can reduce the amount of data needed but this cuts into the features previously mentioned.

If anyone has ideas or wants to chat more about this please let me know though!

@bvaughn bvaughn closed this as completed Oct 22, 2017
@olivernn
Copy link

@bvaughn ah, thats annoying!

It sounds like you've already spent a bunch of time trying to get this to work, is it possible to share the current state in a branch of something?

Also, just for interest, was it the serialised indexes that were too large? What size did they end up being? How many documents were part of the index?

I'm not sure there is much I'll be able to do to reduce the size (though I'm up for a challenge!) but at the very least it would be an interesting dataset to use when testing Lunr for index size and speed.

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 23, 2017

is it possible to share the current state in a branch of something?

I could push my works in progress I guess, if you're really interested 😁 They were a bit dirty, since they were just proofs-of-concept.

Also, just for interest, was it the serialised indexes that were too large? What size did they end up being? How many documents were part of the index?

In both cases, yes it was the index size. As for what I included- I indexed all markdown content.

For lunr, I built an index of title + body text (with the HTML tags removed) and then dumped it to JSON (via JSON.stringify(index.toJSON())). Then I hydrated the index at runtime via lunr.Index.load(...). Unfortunately the resulting index size was 5.4 MB.

For js-worker-search I tried to optimize more by pre-processing (during Gatsby build step). Basically I tokenized searchable text, filtered out stop words, stemmed, and then removed duplicates. (By removing duplicates I was giving up TF-IDF ranking but it made a big impact on size.) Then I created a condensed index file that used a custom text format (rather than JSON) to save on size and improve parsing time. This got the index size down to 605 KB which is still too big considering it lacked intelligent ranking.

Both solutions also lacked the metadata to display drop-down typeahead results in the way Algolia does.

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 23, 2017

Here's the guts of my implementations (not including the require.ensure runtime loading or anything, just the Gatsby plugin bits). Keeping in mind it's not an even comparison since the lunr index supports results ranking.

lunr

const {writeFileSync} = require('fs');
const lunr = require('lunr');
const {join} = require('path');
const sanitize = require('sanitize-html');

exports.createPages = async ({ graphql, boundActionCreators }) => {
  const result = await graphql(query)

  if (result.errors) {
    throw new Error(result.errors.join(`, `))
  }

  const pages = [];

  result.data.allMarkdownRemark.edges.forEach(edge => {
    const html = edge.node.html;
    const slug = edge.node.fields.slug;
    const title = edge.node.frontmatter.title;

    // Strip all HTML markup from searchable content
    const text = sanitize(html, {
      allowedTags: false,
      allowedAttributes: false
    });

    pages.push({
      id: slug,
      text,
      title,
    });
  });

  // Pre-generate Lunr search index
  const index = lunr(function() {
    this.field('text');
    this.field('title');

    pages.forEach(page => {
      this.add(page);
    });
  });

  const path = join(__dirname, '../../public/search.index');
  const data = JSON.stringify(index.toJSON());

  writeFileSync(path, data);
};

const query = `
  {
    allMarkdownRemark {
      edges {
        node {
          html
          frontmatter {
            title
          }
          fields {
            slug
          }
        }
      }
    }
  }
`;

js-worker-search

const {writeFileSync} = require('fs');
const {join} = require('path');
const sanitize = require('sanitize-html');
const stemmer = require('stemmer');
const StopWords = require('./stop-words');

const TOKENIZER_REGEX = /[^a-zа-яё0-9\-\.']+/i;

function tokenize(text) {
  const uniqueWords = {};

  return text
    .split(TOKENIZER_REGEX) // Split words at boundaries
    .filter(word => {
      // Remove empty tokens and stop-words
      return word != '' && StopWords[word] === undefined;
    })
    .map(word => {
      // Stem and lower case (eg "Considerations" -> "consider")
      return stemmer(word.toLocaleLowerCase());
    })
    .filter(word => {
      // Remove duplicates so serialized format is smaller
      // This means we can't later use TF-IDF ranking but maybe that's ok?
      // If we decide later to use it let's pre-generate its metadata also.
      if (uniqueWords[word] === undefined) {
        uniqueWords[word] = true;
        return true;
      }
    });
}

exports.createPages = async ({graphql, boundActionCreators}) => {
  const {createNode} = boundActionCreators;

  const result = await graphql(query);

  if (result.errors) {
    throw new Error(result.errors.join(`, `));
  }

  const searchData = [];

  result.data.allMarkdownRemark.edges.forEach(edge => {
    const html = edge.node.html;
    const slug = edge.node.fields.slug;
    const title = edge.node.frontmatter.title;

    // Strip all HTML markup from searchable content
    const text = sanitize(html, {
      allowedTags: false,
      allowedAttributes: false,
    });

    const index = tokenize(`${text} ${title}`).join(' ');

    searchData.push(`${slug}\t${title}\t${index}`);
  });

  const path = join(__dirname, '../../public/search.index');
  const data = searchData.join('\n');

  writeFileSync(path, data);
};

const query = `
  {
    allMarkdownRemark {
      edges {
        node {
          html
          frontmatter {
            title
          }
          fields {
            slug
          }
        }
      }
    }
  }
`;

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 23, 2017

As for next steps, I found myself thinking about the types of information I would need to add back into the index in order to support more complex drop-down results display, and how I could maybe offset the increased size by dividing the indexes up into chunks and runtime-loading them based on the current search string. It was really starting to feel over-engineered at this point. Still a fun project but not a compelling drop-in replacement for a solution that works well enough already.

Would love to know your thoughts though, @olivernn!

@arwyatt
Copy link

arwyatt commented Oct 24, 2017

@bvaughn what size would you like to see the index be =<?

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 24, 2017

Well...it's hard to say.

If I could add some half decent ranking in to my current data, then I could chunk the indexes up and JIT load them as a user typed. I think that's doable without a ton of effort, but... I'm not sure if people will really miss the inline context snippets Algolia shows for matching text. They are pretty useful.

@olivernn
Copy link

The code you have for Lunr looks idiomatic, I can't speak for js-worker-search code, but in both cases I doubt that how you build the index is going to have any dramatic impact on the index size.

Unfortunately the resulting index size was 5.4 MB.

Was this after compression? The structure of the serialised index probably lends itself reasonably well to some savings from gzip. Looking at the reactjs.org site I see that all assets are less than 500kb, so even with compression I think including the index is going to have a significant impact on the page weight.

One of the things I've been thinking of for a while is to have a more compact serialised form of the index, probably some custom binary format. I don't have any thing concrete but I think getting the serialised index size down would be a huge win for a number of use cases. I only have a finite amount of time though so haven't got far with an implementation.

I'm not sure if people will really miss the inline context snippets Algolia shows for matching text.

It should be possible to get something close with Lunr using the term positions, a basic example of this is shown in this demo, however it involves stuffing more data into the index, which isn't going to help with the size issue :(

It looks like sticking with algolia is the right choice for now, but thanks for taking the time to try out Lunr, this kind of feedback is always invaluable.

@bvaughn
Copy link
Contributor Author

bvaughn commented Oct 24, 2017

Was this after compression?

No. Both of the sizes I mentioned above were just size-on-disk. Gzip compresses the Lunr index down to 721 kB and the js-worker-search index down to 209 kB. (Edit: I was stripping HTML tags incorrect. Looks like the js-worker-search index is actually 131 kB.)

It should be possible to get something close with Lunr using the term positions

Yeah, it's possible with both search libs- but requires adding significantly more metadata to the index- which would in turn require splitting the index into chunks- which would add complexity, etc. 😦

Thanks a bunch for taking the time to talk this through though and share feedback. Greatly appreciated!

jhonmike pushed a commit to jhonmike/reactjs.org that referenced this issue Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants