You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Documents separator groups for recursive chunking strategy (#3848)
This PR adds documentation for the `plaintext` and `markdown` separator
group options available in the `recursive` chunking strategy.
## Changes
- Added documentation section for predefined separator groups
(`plaintext` and `markdown`)
- Added regex pattern details into collapsible dropdowns
- Refined wording and structure for clarity
Related issue: #3015
---------
Co-authored-by: Benjamin Ironside Goldstein <[email protected]>
Copy file name to clipboardExpand all lines: explore-analyze/elastic-inference/inference-api.md
+60-6Lines changed: 60 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -161,11 +161,64 @@ PUT _inference/sparse_embedding/word_chunks
161
161
stack: ga 9.1
162
162
```
163
163
164
-
The `recursive` strategy splits the input text based on a configurable list of separator patterns (for example, newlines or Markdown headers). The chunker applies these separators in order, recursively splitting any chunk that exceeds the `max_chunk_size` word limit. If no separator produces a small enough chunk, the strategy falls back to sentence-level splitting.
164
+
The `recursive` strategy splits the input text based on a configurable list of separator patterns, such as paragraph boundaries or Markdown structural elements like headings and horizontal rules. The chunker applies these separators in order, recursively splitting any chunk that exceeds the `max_chunk_size` word limit. If no separator produces a small enough chunk, the strategy falls back to [sentence-level splitting](#sentence).
165
165
166
-
##### Markdown separator group
166
+
You can configure the `recursive` strategy using either:
167
+
-[Predefined separator groups](#separator-groups): [`Plaintext`](#plaintext) or [`markdown`](#markdown)
168
+
-[Custom separators](#custom-separators): Define your own regular expression patterns
167
169
168
-
The following example creates an {{infer}} endpoint with the `elasticsearch` service that deploys the ELSER model and configures chunking with the `recursive` strategy using the markdown separator group and a maximum of 200 words per chunk.
170
+
##### Predefined separator groups [separator-groups]
171
+
172
+
Predefined separator groups provide optimized patterns for common text formats: [`plaintext`](#plaintext) works for simple line-structured text without markup, and [`markdown`](#markdown) works for Markdown-formatted content.
173
+
174
+
###### `plaintext`
175
+
176
+
The `plaintext` separator group splits text at paragraph boundaries, first attempting to split on double newlines (paragraph breaks), then falling back to single newlines when chunks are still too large.
177
+
178
+
:::{dropdown} Regular expression patterns for the `plaintext` separator group
179
+
180
+
1.`(?<!\\n)\\n\\n(?!\\n)`: Splits on consecutive newlines that indicate paragraph breaks.
181
+
2.`(?<!\\n)\\n(?!\\n)`: Splits on single newlines when double newlines don't produce small enough chunks.
182
+
183
+
:::
184
+
185
+
The following example configures chunking with the `recursive` strategy using the `plaintext` separator group and a maximum of 200 words per chunk.
186
+
187
+
```console
188
+
PUT _inference/sparse_embedding/recursive_plaintext_chunks
189
+
{
190
+
"service": "elasticsearch",
191
+
"service_settings": {
192
+
"model_id": ".elser_model_2",
193
+
"num_allocations": 1,
194
+
"num_threads": 1
195
+
},
196
+
"chunking_settings": {
197
+
"strategy": "recursive",
198
+
"max_chunk_size": 200,
199
+
"separator_group": "plaintext"
200
+
}
201
+
}
202
+
```
203
+
204
+
###### `markdown`
205
+
206
+
The `markdown` separator group splits text based on Markdown structural elements, processing separators hierarchically from highest to lowest level: H1 through H6 headings, then horizontal rules.
207
+
208
+
:::{dropdown} Regular expression patterns for the `markdown` separator group
209
+
210
+
1.`\n# `: Splits on level 1 headings (H1).
211
+
2.`\n## `: Splits on level 2 headings (H2).
212
+
3.`\n### `: Splits on level 3 headings (H3).
213
+
4.`\n#### `: Splits on level 4 headings (H4).
214
+
5.`\n##### `: Splits on level 5 headings (H5).
215
+
6.`\n###### `: Splits on level 6 headings (H6).
216
+
7.`\n^(?!\\s*$).*\\n-{1,}\\n`: Splits on horizontal rules created with hyphens.
217
+
8.`\n^(?!\\s*$).*\\n={1,}\\n`: Splits on horizontal rules created with equals signs.
218
+
219
+
:::
220
+
221
+
The following example configures chunking with the `recursive` strategy using the `markdown` separator group and a maximum of 200 words per chunk.
169
222
170
223
```console
171
224
PUT _inference/sparse_embedding/recursive_markdown_chunks
@@ -184,10 +237,9 @@ PUT _inference/sparse_embedding/recursive_markdown_chunks
184
237
}
185
238
```
186
239
187
-
##### Custom separator group
188
-
189
-
The following example creates an {{infer}} endpoint with the `elasticsearch` service that deploys the ELSER model and configures chunking with the `recursive` strategy. It uses a custom list of separators to split plaintext into chunks of up to 180 words.
240
+
##### Custom separators
190
241
242
+
If the [predefined separator groups](#separator-groups) don't meet your needs, you can define custom separators using regular expressions. The following example configures chunking with the `recursive` strategy using a custom list of separators to split text into chunks of up to 180 words.
191
243
192
244
```console
193
245
PUT _inference/sparse_embedding/recursive_custom_chunks
@@ -236,3 +288,5 @@ PUT _inference/sparse_embedding/none_chunking
0 commit comments