Skip to content

Release 2025-05-19 #6821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
May 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
7dfa3cd
Edits to PEM 10 - first set
ebgitelman Apr 25, 2025
bd8e5fa
First pass at second batch
ebgitelman Apr 25, 2025
9301470
Edits to PEM10 - second round
ebgitelman Apr 28, 2025
ffa64a1
Update 01_cluster_properties.mdx
LawrenceMan May 9, 2025
d839f18
Merge branch 'develop' into patch-1
LawrenceMan May 10, 2025
f08293f
Update index.mdx
LawrenceMan May 11, 2025
6bc2b4e
Merge branch 'develop' into patch-1
LawrenceMan May 12, 2025
306501c
Merge branch 'develop' into patch-2
LawrenceMan May 12, 2025
ce838b8
Merge branch 'develop' into patch-1
LawrenceMan May 13, 2025
0781669
Merge branch 'develop' into patch-2
LawrenceMan May 13, 2025
75af114
Release Notes for 4.1.0 stubbed
djw-m May 14, 2025
abed1b5
fix typo
noahbaculi May 15, 2025
0340e4a
update primitive examples with output
noahbaculi May 15, 2025
9bd74cc
fix typo
noahbaculi May 15, 2025
2f4d31b
update preparer usage with better column name for unnested chunk
noahbaculi May 15, 2025
23feea1
add output to chunk auto processing ex
noahbaculi May 15, 2025
8439f24
add output to chunk text ex
noahbaculi May 15, 2025
f008c1b
add output to parse html ex
noahbaculi May 15, 2025
57a6d91
add output to parse pdf ex
noahbaculi May 15, 2025
56f1f9a
add output to ocr ex
noahbaculi May 15, 2025
22f9cf9
add output to summarize ex
noahbaculi May 15, 2025
7fc3009
add tips to reference new Unnesting concept
noahbaculi May 15, 2025
334ad98
update source_key_column reference description to include uniqueness …
noahbaculi May 15, 2025
7470073
init chained preparers ex
noahbaculi May 15, 2025
d19c476
init rel notes
noahbaculi May 15, 2025
f726fe8
update generated release notes
github-actions[bot] May 15, 2025
8bca94c
refine rel note
noahbaculi May 15, 2025
cef7ae0
update generated release notes
github-actions[bot] May 15, 2025
a8aa0f4
Release Notes for 4.1.0 stubbed
djw-m May 14, 2025
e809ecf
Remove New from front page
djw-m May 15, 2025
3d41b9f
Merge branch 'DOCS-1552--aidb-aidb-4-1-0-release-train' into aidb-pre…
djw-m May 15, 2025
75e8176
Release Notes for 4.1.0 stubbed
djw-m May 14, 2025
5159c17
Remove New from front page
djw-m May 15, 2025
3540c52
Notes about model batch processing
timwaizenegger May 15, 2025
2d3d867
performance tuning guide
timwaizenegger May 15, 2025
5ed14e5
note on index type
timwaizenegger May 15, 2025
49150a7
note on index type
timwaizenegger May 15, 2025
ae429d4
Merge pull request #6762 from LawrenceMan/patch-1
nidhibhammar May 15, 2025
fe1ca3c
Merge pull request #6776 from LawrenceMan/patch-2
nidhibhammar May 15, 2025
88c1293
Merge pull request #6727 from EnterpriseDB/docs/edits_to_pem10_group1
nidhibhammar May 19, 2025
9b5df46
Merge pull request #6730 from EnterpriseDB/docs/edits_to_pem10_group2
nidhibhammar May 19, 2025
47edd9f
Add documentation for Google Cloud Storage
mildbyte May 19, 2025
0b3a128
document PGFS non-https usage
timwaizenegger May 19, 2025
0bf9b99
Merge branch 'develop' into DOCS-1552--aidb-aidb-4-1-0-release-train
timwaizenegger May 19, 2025
2cd765c
Merge branch 'DOCS-1552--aidb-aidb-4-1-0-release-train' into aidb-pre…
timwaizenegger May 19, 2025
19ac85e
update generated release notes
github-actions[bot] May 19, 2025
253ee9d
Edit of two docs in this group
ebgitelman Apr 28, 2025
fb5527d
Edits to PEM 10 - group 3
ebgitelman Apr 29, 2025
867d7a3
removed ha_using_efm file
nidhibhammar May 19, 2025
15be8b1
Merge pull request #6735 from EnterpriseDB/docs/edits_to_pem10_group3
nidhibhammar May 19, 2025
376ad8d
Release Notes for 4.1.0 stubbed
djw-m May 14, 2025
70cec1f
Remove New from front page
djw-m May 15, 2025
a30f06c
Merge branch 'DOCS-1552--aidb-aidb-4-1-0-release-train' into aidb-pre…
djw-m May 19, 2025
715608d
Merge pull request #6812 from EnterpriseDB/aidb-preparer-unnested
djw-m May 19, 2025
e6fb7cf
update generated release notes
github-actions[bot] May 19, 2025
41a7e99
Merge pull request #6820 from EnterpriseDB/feature/pgfs-gcp-support
djw-m May 19, 2025
4b78f4b
Fix bad link
djw-m May 19, 2025
198fbab
Merge branch 'develop' into DOCS-1552--aidb-aidb-4-1-0-release-train
djw-m May 19, 2025
fa88614
fix rel notes typo
djw-m May 19, 2025
9539567
Merge pull request #6809 from EnterpriseDB/DOCS-1552--aidb-aidb-4-1-0…
djw-m May 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,11 @@ As well as for existing pipelines:
- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base)

## Batch processing
In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch,
In Background and Disabled modes, (auto) processing happens in batches of configurable size. The pipeline will process all source records in batches.
All records within each batch are processed in parallel wherever possible. This means pipeline steps like data retrieval, embeddings computation, and storing embeddings will run as parallel operations.
E.g., when using a table as a data source, a batch of input records will be retrieved with a single query. With a volume source, concurrent requests will be used to retrieve a batch of records.

Our [knowledge base pipeline performance tuning guide](../knowledge_base/performance_tuning) explains how the batch size can be tuned for optimal throughput.

## Change detection
AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: "Pipelines knowledge base performance tuning"
navTitle: "Performance tuning"
deepToC: true
description: "How to tune the performance of knowledge base pipelines."
---


## Background
The performance (i.e., throughput of embeddings per second) can be optimized by changing pipeline and model settings.
This guide explains the relevant settings and shows how to tune them.

Knowledge base piplines process collections of individual records (rows in a table or objects in a volume). Rather than processing each record individually and sequentially, or processing all of them concurrently,
AIDB offers batch processing. All the batches get processed sequentially, one after the other. Within each batch, records get processed concurrently wherever possible.

- [Pipeline `batch_size`](../capabilities/auto-processing) determines how many records each batch should have
- Some model providers have configurable internal batch/parallel processing. We recommend leaving these setting at the default values and using the pipeline batch size to control execution.

!!! Note
vector indexing also has an impact on pipeline performance. You can disable the vector by using `index_type => 'disabled'` to exclude it from your measurements.
!!!

## Testing and tuning performance
We will first set up test data and a knowledge base pipeline, then measure and tune the batch size.

### 1) Create a table and insert test data
The actual data content length has some impact on model performance. You can use longer text to test that.
```sql
CREATE TABLE test_data_10k (id INT PRIMARY KEY, msg TEXT NOT NULL);

INSERT INTO test_data_10k (id, msg) SELECT generate_series(1, 10000) AS id, 'hello world';
```


### 2) Create a knowledge base pipeline
The optimal batch size may be very different for different models. Measure and tune the batch size for each different model you want to use.
```sql
SELECT aidb.create_table_knowledge_base(
name => 'perf_test_b',
model_name => 'dummy', -- use the model you want to optimize for
source_table => 'test_data_10k',
source_data_column => 'msg',
source_data_format => 'Text',
index_type => 'disabled', -- optionally disable vector indexing to include/exclude it from the measurement
auto_processing => 'Disabled', -- we want to manually run the pipeline to measure the runtime
batch_size => 100 -- this is the paramter we will tune during this test
);
__OUTPUT__
INFO: using vector table: public.perf_test_vector
NOTICE: index "vdx_perf_test_vector" does not exist, skipping
NOTICE: auto-processing is set to "Disabled". Manually run "SELECT aidb.bulk_embedding('perf_test');" to compute embeddings.
create_table_knowledge_base
-----------------------------
perf_test
(1 row)
```

### 3) Run the pipeline, measure the performance
We use `psql` in this test; the `\timing on` command is a feature in psql. If you use a different interface, check how it can display timing information.

```sql
\timing on
__OUTPUT__
Timing is on.
```

Now run the pipeline:
```sql
SELECT aidb.bulk_embedding('perf_test');
__OUTPUT__
INFO: perf_test: (re)setting state table to process all data...
INFO: perf_test: Starting... Batch size 100, unprocessed rows: 10000, count(source records): 10000, count(embeddings): 0
INFO: perf_test: Batch iteration finished, unprocessed rows: 9900, count(source records): 10000, count(embeddings): 100
INFO: perf_test: Batch iteration finished, unprocessed rows: 9800, count(source records): 10000, count(embeddings): 200
...
INFO: perf_test: Batch iteration finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000
INFO: perf_test: finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000
bulk_embedding
----------------

(1 row)

Time: 207161,174 ms (03:27,161)
```



### 4) Tune the batch size
You can use this call to adjust the batch size of the pipeline. We increase by 10x to 1000 records:
```sql
SELECT aidb.set_auto_knowledge_base('perf_test', 'Disabled', batch_size=>1000);
```

Run the pipeline again.

!!! Note
When using a Postgres table as the source, with auto-processing disabled, AIDB has no means to detect changes in the source data. So each bulk_embedding call has to re-process everything.

This is convenient for performance testing.

If you want to measure performance with a volumes source, you should delete and re-create the knowledge base between each test. AIDB is able to detect changes on volumes even with auto-procesing disabled.

!!!
```sql
SELECT aidb.bulk_embedding('perf_test');
__OUTPUT__
INFO: perf_test: (re)setting state table to process all data...
INFO: perf_test: Starting... Batch size 1000, unprocessed rows: 10000, count(source records): 10000, count(embeddings): 10000
...
INFO: perf_test: finished, unprocessed rows: 0, count(source records): 10000, count(embeddings): 10000
bulk_embedding
----------------

(1 row)

Time: 154276,486 ms (02:34,276)
```


## Conclusion
In this test, the pipeline took 02:34 min with batch size 1000 and 03:27 min with size 100. You can continue testing larger sizes until performance no longer improves, or even declines.
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Based on the name of the model, the model provider sets defaults accordingly:
## Creating the default with OpenAI model

```sql
SELECT aidb.create_model('my_openai_embeddings',
SELECT aidb.create_model('my_openai_embeddings',
'openai_embeddings',
credentials=>'{"api_key": "sk-abc123xyz456def789ghi012jkl345mn"'::JSONB);
```
Expand All @@ -58,7 +58,7 @@ SELECT aidb.create_model(
'my_openai_model',
'openai_embeddings',
'{"model": "text-embedding-3-small"}'::JSONB,
'{"api_key": "sk-abc123xyz456def789ghi012jkl345mn"}'::JSONB
'{"api_key": "sk-abc123xyz456def789ghi012jkl345mn"}'::JSONB
);
```

Expand All @@ -69,12 +69,35 @@ Because this example is passing the configuration options and the credentials, u
The following configuration settings are available for OpenAI models:

* `model` — The OpenAI model to use.
* `url` — The URL of the model to use. This value is optional and can be used to specify a custom model URL.
* If `openai_completions` (or `completions`) is the `model`, `url` defaults to `https://api.openai.com/v1/chat/completions`.
* `url` — The URL of the model to use. This value is optional and can be used to specify a custom model URL.
* If `openai_completions` (or `completions`) is the `model`, `url` defaults to `https://api.openai.com/v1/chat/completions`.
* If `nim_completions` is the `model`, `url` defaults to `https://integrate.api.nvidia.com/v1/chat/completions`.
* `max_concurrent_requests` — The maximum number of concurrent requests to make to the OpenAI model. The default is `25`.

## Model credentials
* `max_batch_size` — The maximum number of records to send to the model in a single request. The default is `50.000`.

### Batch and parallel processing
The model providers for `embeddings`, `openai_embeddings`, and `nim_embeddings` support sending batch requests as well as concurrent requests.
The two settings `max_concurrent_requests` and `max_batch_size` control this behavior. When a model provider receives a set of records (E.g., from a knowledge base pipeline)
the following happens:
* Assuming the knowledge base pipeline is configured with batch size 10.000.
* And the model provider is configured with `max_batch_size=1000` and `max_concurrent_requests=5`.
* Then, the provider will collect up to 1000 records and send them in a single request to the model.
* And it will send 5 such large requests concurrently, until no more input records are left.
* So in this example, the provider needs to send/receive 10 batches in total.
* After sending the first 5, it waits for the responses to return.
* Once a response is received, another request can be sent.
* This means the provider won't wait for all 5 to return before sending off the next 5. Instead, it always keeps up to 5 requests in flight.

!!! Note
The settings `max_concurrent_requests` and `max_batch_size` can have a significant impact on model performance. But they highly depend on
the hardware and infrastructure.

We recommend leaving the defaults in place and [tuning the performance via the knowledge base pipeline batch size.](../../knowledge_base/performance_tuning)
The default `max_batch_size` of 50.000 is intentionally high to allow the pipeline to control the actual size of the batches.
!!!


### Model credentials
The following credentials may be required by the service providing these models. Note: `api_key` and `basic_auth` are exclusive. Only one of these two options can be used.

* `api_key` &mdash; The API key to use for Bearer Token authentication. The api_key will be sent in a header field as `Authorization: Bearer <api_key>`.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: "Pipelines PGFS with Google Cloud Storage"
navTitle: "Google Cloud storage"
description: "PGFS options and credentials with Google Cloud Storage."
---


## Overview: Google Cloud Storage
PGFS uses the `gs:` prefix to indicate an Google Cloud Storage bucket.

The general syntax for using GCS is this:
```sql
select pgfs.create_storage_location(
'storage_location_name',
'gs://bucket_name',
credentials => '{}'::JSONB
);
```

### The `credentials` argument in JSON format offers the following settings:
| Option | Description |
|------------------------------------|------------------------------------------|
| `google_application_credentials` | Path to the application credentials file |
| `google_service_account_key_file` | Path to the service account key file |

See the [Google Cloud documentation](https://cloud.google.com/iam/docs/keys-create-delete#creating) for more information on how to manage service account keys.

These options can also be set up via the equivalent environment variables to facilitate authentication in managed environments such as Google Kubernetes Engine.

## Example: private GCS bucket

```sql
SELECT pgfs.create_storage_location('edb_ai_example_images', 'gs://my-company-ai-images',
credentials => '{"google_service_account_key_file": "/var/run/gcs.json"}'
);
```

## Example: authentication in GKE

Ensure that the `GOOGLE_APPLICATION_CREDENTIALS` or the `GOOGLE_SERVICE_ACCOUNT_KEY_FILE` environment variable
is set on your PostgreSQL pod. Then, PGFS will automatically pick them up:

```sql
SELECT pgfs.create_storage_location('edb_ai_example_images', 'gs://my-company-ai-images');
```
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ select pgfs.create_storage_location(
| `skip_signature` | Disable HMAC authentication (set this to "true" when you're not providing access_key_id/secret_access_key in the credentials). |
| `region` | The region of the S3-compatible storage system. If the region is not specified, the client will attempt auto-discovery. |
| `endpoint` | The endpoint of the S3-compatible storage system. |
| `allow_http` | Whether the endpoint uses plain HTTP (rather than HTTPS/TLS). Set this to `true` if your endpoint starts with `http://`. |

### The `credentials` argument in JSON format offers the following settings:
| Option | Description |
Expand Down Expand Up @@ -53,7 +54,7 @@ SELECT pgfs.create_storage_location('internal_ai_project', 's3://my-company-ai-i
);
```

## Example: non-AWS S3 / S3-compatible
## Example: non-AWS S3 / S3-compatible with HTTPS
This is an example of using an S3-compatible system like minIO. The `endpoint` must be provided in this case; it can only be omitted when using AWS S3.

```sql
Expand All @@ -63,4 +64,16 @@ SELECT pgfs.create_storage_location('ai_images_local_minio', 's3://my-ai-images'
);
```

## Example: non-AWS S3 / S3-compatible with HTTP
This is an example of using an S3-compatible system like minIO. The `endpoint` must be provided in this case; it can only be omitted when using AWS S3.

In this case, the server does not use TLS encryption; so we configure a plain HTTP connection.

```sql
SELECT pgfs.create_storage_location('ai_images_local_minio', 's3://my-ai-images',
options => '{"endpoint": "http://minio-api.apps.local", "allow_http":"true"}',
credentials => '{"access_key_id": "my_username", "secret_access_key":"my_password"}'
);
```


Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ Bulk data preparation performs a preparer's associated operation for all of the
Bulk data preparation does not delete existing destination data unless it conflicts with newly generated data. It is recommended to configure separate destination tables for each preparer.
!!!

## Unnesting

Some Preparer [Primitives](./primitives) transform the shape of the data they are given. For example, `ChunkText` receives one text block and produces one or more text blocks. Rather than return nested collections of results, these Primitives automatically unnest (or "explode") their output, using a new `part_id` column to track the additional dimension.

You can see this in action in [Primitives](./primitives) and in the applicable [examples](./examples).

## Consistency with source data

To ensure correct and consistent data, the prepared destination data must be in sync with the source data. In the case of the table data source, you can enable preparer auto processing to inform the preparer pipeline about changes to the source data.
Expand Down
Loading