Skip to content

Commit 091b432

Browse files
committed
docs: update
1 parent fc9b266 commit 091b432

File tree

6 files changed

+87
-42
lines changed

6 files changed

+87
-42
lines changed

docs/plugins/search/meilisearch.md

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Then, create a **correct configuration file** for the scraper. Here, we provide
112112
```
113113

114114
- `index_uid` should be a unique name for your index, which will be used to search.
115-
- `start_urls` and `sitemap_urls` (optional) shall be customized according to the website to be scraped.
115+
- `start_urls` and `sitemap_urls` (optional) shall be customized according to the website to be scraped. We recommend using it with [`@vuepress/plugin-sitemap`](../seo/sitemap/README.md) plugin and providing the corresponding `sitemap.xml` URL.
116116
- `selectors` field can be customized according to third-party theme DOM structure.
117117
- You can add new fields to `custom_settings` according to your needs.
118118

@@ -144,23 +144,35 @@ Here:
144144

145145
When the scraper completes, MeiliSearch will update the existing index with latest document content.
146146

147-
Each time the scraper deletes and recreates the index. During this process, all the documents will be deleted and re-added. This might be slow for too many documents. However, when we only need to update part of the document content, we can use `only_urls` to tell the scraper to update only the specified urls instead of crawling all of them once.
147+
抓取器每次都会将索引删除并重新创建,在这个过程中所有的文档都将被删除并重新添加,这对过多的文档来说可能会很慢。所以我们的 `jqiue/docs-scraper` 允许你提供 `only_urls` 只抓取变更的文档内容时。
148148

149-
```json
150-
{
151-
"only_urls": ["https://<YOUR_WEBSITE_URL>/specifies/"]
152-
}
153-
```
149+
Each time the scraper deletes and recreates the index, all documents will be deleted and re-added. This can be slow for a large number of documents. Therefore, our `jqiue/docs-scraper` allows you to provide `only_urls` to only scrape the changed document content.
154150

155-
Using `npx gous <docsDir> <replaceUrl> <scraperPath>` in your project can automatically generate `only_urls` for your scraper configuration file.
151+
```sh
152+
Usage: vp-meilisearch-crawler [options] <source> [scraper-path]
153+
154+
Generate crawler config for meilisearch
155+
156+
Arguments:
157+
source Source directory of VuePress project
158+
scraper-path Scrapper config file path (default: .vuepress/meilisearch-config.json relative to source folder)
159+
160+
Options:
161+
-c, --config [config] Set path to config file
162+
--cache [cache] Set the directory of the cache files
163+
--temp [temp] Set the directory of the temporary files
164+
--clean-cache Clean the cache files before generation
165+
--clean-temp Clean the temporary files before generation
166+
-V, --version output the version number
167+
-h, --help display help for command
168+
```
156169

157-
::: tip description
170+
You can use `vp-meilisearch-scrapper <docsDir> <scraperPath>` in CI or Git Hooks to automatically generate `only_urls` for your scraper configuration file.
158171

159-
If your project is not managed using Git or the os does not have Git installed, it cannot run.
172+
::: note
160173

161-
- `docsDir` The parent directory of `.vuepress`. For example, if your directory is `docs/.vuepress`, then this value is `docs`
162-
- `replaceUrl` The URL of your document.
163-
- `scraperPath` The path of the scraper configuration file
174+
- `vp-meilisearch-crawler` needs to be run in a Git project.
175+
- `scraper-path` must correctly point to your scraper configuration file, which should be properly set up with all necessary fields except for `only_urls`.
164176

165177
:::
166178

@@ -258,6 +270,13 @@ jobs:
258270
steps:
259271
- name: Checkout
260272
uses: actions/checkout@v4
273+
with:
274+
# This is required for the helper to compare the current and previous commits
275+
fetch-depth: 2
276+
277+
- name: Generate Only URLs
278+
# You may need to cd to the directory where `@vuepress/plugin-meilisearch` is installed first
279+
run: pnpm vp-meilisearch-scrapper <docsDir> <path/to/your/scraper/config.json>
261280

262281
- name: Run scraper
263282
env:

docs/zh/plugins/search/meilisearch.md

Lines changed: 34 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ docker pull jqiue/docs-scraper:latest
112112
```
113113

114114
- `index_uid` 应为你的索引分配一个唯一名称,用于搜索。
115-
- `start_urls``sitemap_urls`(可选)应根据要抓取的网站进行自定义。
115+
- `start_urls``sitemap_urls`(可选)应根据要抓取的网站进行自定义。我们建议与 [`@vuepress/plugin-sitemap`](../seo/sitemap/README.md) 插件一起使用并提供对应的 `sitemap.xml` URL。
116116
- `selectors` 字段可以根据第三方主题 DOM 结构进行自定义。
117117
- 你可以根据需要向 `custom_settings` 中添加新字段。
118118

@@ -144,23 +144,33 @@ docker run -t --rm \
144144

145145
抓取完成后,MeiliSearch 将更新现有索引以包含最新的文档内容。
146146

147-
抓取器每次都会将索引删除并重新创建,在这个过程中所有的文档都将被删除并重新添加,这对过多的文档来说可能会很慢,但我们只需要更新部分文档内容时就可以使用 `only_urls` 告诉抓取器只更新指定的 URL 而不必全部抓取一遍
147+
抓取器每次都会将索引删除并重新创建,在这个过程中所有的文档都将被删除并重新添加,这对过多的文档来说可能会很慢。所以我们的 `jqiue/docs-scraper` 允许你提供 `only_urls` 只抓取变更的文档内容时。
148148

149-
```json
150-
{
151-
"only_urls": ["https://<YOUR_WEBSITE_URL>/specifies/"]
152-
}
153-
```
149+
你可以在 CI 或 Git Hooks 中使用 `vp-meilisearch-scrapper <docsDir> <scraperPath>` 为你的抓取器配置文件自动生成 `only_urls`
154150

155-
在项目使用`npx gous <docsDir> <replaceUrl> <scraperPath>`可以为你的抓取器配置文件自动生成 `only_urls`
156-
157-
::: tip 说明
151+
```sh
152+
使用: vp-meilisearch-crawler [options] <source> [scraper-path]
153+
154+
生成 meilisearch 的抓取器配置文件
155+
156+
参数:
157+
source VuePress 项目的源目录
158+
scraper-path 抓取器配置文件路径 (默认: .vuepress/meilisearch-config.json 相对于源文件夹)
159+
160+
选项:
161+
-c, --config [config] 设置配置文件路径
162+
--cache [cache] 设置缓存文件目录
163+
--temp [temp] 设置临时文件目录
164+
--clean-cache 在生成之前清理缓存文件
165+
--clean-temp 在生成之前清理临时文件
166+
-V, --version 输出版本号
167+
-h, --help 显示帮助信息
168+
```
158169

159-
如果你的项目不是使用 Git 进行管理或者系统没有安装 Git 则无法运行
170+
::: note
160171

161-
- `docsDir` `.vuepress`的父目录,比如你的目录是`docs/.vuepress`,则该值为 `docs`
162-
- `replaceUrl` 网站的 URL
163-
- `scraperPath` 抓取器配置文件的路径
172+
- `vp-meilisearch-scrapper` 需要在 Git 项目中运行。
173+
- `scraper-path` 必须正确指向你的抓取器配置文件,这个文件应正确设置除了 `only_urls` 之外的所有必要字段。
164174

165175
:::
166176

@@ -256,11 +266,18 @@ jobs:
256266
runs-on: ubuntu-latest
257267
name: 重新抓取 MeiliSearch 文档
258268
steps:
259-
- 名称:Checkout
269+
- name: 检出
260270
uses: actions/checkout@v4
271+
with:
272+
# 这是比较当前和上一个提交所必需的
273+
fetch-depth: 2
274+
275+
- name: Generate Only URLs
276+
# 你可能需要先 cd 到安装 `@vuepress/plugin-meilisearch` 的目录
277+
run: pnpm vp-meilisearch-scrapper <docsDir> <path/to/your/scraper/config.json>
261278

262-
- 名称:运行抓取器
263-
env
279+
- name: 运行抓取器
280+
env:
264281
# 替换为你自己的 MeiliSearch 主机 URL
265282
HOST_URL: <YOUR_MEILISEARCH_HOST_URL>
266283
API_KEY: ${{ secrets.MEILISEARCH_MASTER_KEY }}

plugins/search/plugin-meilisearch/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
"style": "sass src:lib --embed-sources --style=compressed --pkg-importer=node"
3939
},
4040
"bin": {
41-
"vp-meilisearch-crawler": "./lib/cli/index.js"
41+
"vp-meilisearch-scrapper": "./lib/cli/index.js"
4242
},
4343
"dependencies": {
4444
"@vuepress/helper": "workspace:*",

plugins/search/plugin-meilisearch/src/cli/generateScraperConfig.ts

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,9 @@ const generateOnlyUrls = (
4141
},
4242
{},
4343
)
44+
4445
const siteDestLocation =
45-
new URL(scraperConfig.start_urls[0]).hostname + app.options.base
46+
new URL(scraperConfig.start_urls[0]).origin + app.options.base.slice(0, -1)
4647

4748
return changedMarkdownFilesPathRelative.map(
4849
(markdownFilePathRelative) =>
@@ -102,21 +103,19 @@ export const generateScraperConfig = async (
102103
await fs.remove(app.dir.cache())
103104
}
104105

105-
const outputPath = output
106+
const scraperPath = output
106107
? path.join(process.cwd(), output)
107108
: path.join(app.dir.source(), '.vuepress', 'meilisearch-config.json')
108109

109110
if (!fs.existsSync(source)) {
110111
throw new Error(`Source directory ${source} does not exist!`)
111112
}
112113

113-
const scraperPath = path.resolve(output)
114-
115-
if (!fs.existsSync(outputPath)) {
114+
if (!fs.existsSync(scraperPath)) {
116115
throw new Error(`Scraper file not found at ${scraperPath}`)
117116
}
118117

119-
const scraperConfig = fs.readJSONSync(outputPath, 'utf-8') as ScraperConfig
118+
const scraperConfig = fs.readJSONSync(scraperPath, 'utf-8') as ScraperConfig
120119

121120
const sourceRelativePath = getGitRelativePath(app.dir.source())
122121

@@ -132,6 +131,14 @@ export const generateScraperConfig = async (
132131
return
133132
}
134133

134+
// initialize vuepress app to get pages
135+
logger.info('Initializing VuePress and preparing data...')
136+
137+
// initialize vuepress app to get pages
138+
await app.init()
139+
140+
logger.info('Generating only_urls...')
141+
135142
const onlyUrls = generateOnlyUrls(
136143
app,
137144
changedMarkdownFilesPathRelative,

plugins/search/plugin-meilisearch/src/cli/index.ts

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
import { createCommand } from 'commander'
44

5+
import { logger } from 'vuepress/utils'
56
import pkg from '../../package.json' with { type: 'json' }
6-
import { generateScraperConfig } from './generateScraperConfig'
7+
import { generateScraperConfig } from './generateScraperConfig.js'
78

89
interface MeiliSearchCommandOptions {
910
config?: string
@@ -25,8 +26,8 @@ program
2526
.option('--clean-temp', 'Clean the temporary files before generation')
2627
.argument('<source>', 'Source directory of VuePress project')
2728
.argument(
28-
'[output]',
29-
'Output folder (default: .vuepress/meilisearch-config.json relative to source folder)',
29+
'[scraper-path]',
30+
'Scrapper config file path (default: .vuepress/meilisearch-config.json relative to source folder)',
3031
)
3132
.action(
3233
async (
@@ -37,7 +38,8 @@ program
3738
try {
3839
await generateScraperConfig(sourceDir, output, commandOptions)
3940
} catch (error) {
40-
program.error(`Command execution error: ${(error as Error).message}`)
41+
logger.error(error)
42+
program.error(`Command execution error.`)
4143
}
4244
},
4345
)

plugins/search/plugin-meilisearch/src/cli/shouldRescrape.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import type { SpawnSyncReturns } from 'node:child_process'
22
import { spawnSync } from 'node:child_process'
33
import { logger } from 'vuepress/utils'
4-
import { getWorkspaceStatus } from './utils'
4+
import { getWorkspaceStatus } from './utils.js'
55

66
/**
77
* Checks if a full rescrape is needed by examining the most recent commit

0 commit comments

Comments
 (0)