Skip to content

Commit b9fff73

Browse files
fix: revert sources targets and sidebar to previous states
1 parent ef87b68 commit b9fff73

File tree

6 files changed

+358
-0
lines changed

6 files changed

+358
-0
lines changed

docs/docs/sources/amazons3.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: AmazonS3
3+
toc_max_heading_level: 4
4+
description: CocoIndex AmazonS3 Built-in Sources
5+
---
6+
7+
### Setup for Amazon S3
8+
9+
#### Setup AWS accounts
10+
11+
You need to setup AWS accounts to own and access Amazon S3. In particular,
12+
13+
* Setup an AWS account from [AWS homepage](https://aws.amazon.com/) or login with an existing account.
14+
* AWS recommends all programming access to AWS should be done using [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) instead of root account. You can create an IAM user at [AWS IAM Console](https://console.aws.amazon.com/iam/home).
15+
* Make sure your IAM user at least have the following permissions in the IAM console:
16+
* Attach permission policy `AmazonS3ReadOnlyAccess` for read-only access to Amazon S3.
17+
* (optional) Attach permission policy `AmazonSQSFullAccess` to receive notifications from Amazon SQS, if you want to enable change event notifications.
18+
Note that `AmazonSQSReadOnlyAccess` is not enough, as we need to be able to delete messages from the queue after they're processed.
19+
20+
21+
#### Setup Credentials for AWS SDK
22+
23+
AWS SDK needs to access credentials to access Amazon S3.
24+
The easiest way to setup credentials is to run:
25+
26+
```sh
27+
aws configure
28+
```
29+
30+
It will create a credentials file at `~/.aws/credentials` and config at `~/.aws/config`.
31+
32+
See the following documents if you need more control:
33+
34+
* [`aws configure`](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
35+
* [Globally configuring AWS SDKs and tools](https://docs.aws.amazon.com/sdkref/latest/guide/creds-config-files.html)
36+
37+
38+
#### Create Amazon S3 buckets
39+
40+
You can create a Amazon S3 bucket in the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), and upload your files to it.
41+
42+
It's also doable by using the AWS CLI `aws s3 mb` (to create buckets) and `aws s3 cp` (to upload files).
43+
When doing so, make sure your current user also has permission policy `AmazonS3FullAccess`.
44+
45+
#### (Optional) Setup SQS queue for event notifications
46+
47+
You can setup an Amazon Simple Queue Service (Amazon SQS) queue to receive change event notifications from Amazon S3.
48+
It provides a change capture mechanism for your AmazonS3 data source, to trigger reprocessing of your AWS S3 files on any creation, update or deletion. Please use a dedicated SQS queue for each of your S3 data source.
49+
50+
This is how to setup:
51+
52+
* Create a SQS queue with proper access policy.
53+
* In the [Amazon SQS Console](https://console.aws.amazon.com/sqs/home), create a queue.
54+
* Add access policy statements, to make sure Amazon S3 can send messages to the queue.
55+
```json
56+
{
57+
...
58+
"Statement": [
59+
...
60+
{
61+
"Sid": "__publish_statement",
62+
"Effect": "Allow",
63+
"Principal": {
64+
"Service": "s3.amazonaws.com"
65+
},
66+
"Resource": "${SQS_QUEUE_ARN}",
67+
"Action": "SQS:SendMessage",
68+
"Condition": {
69+
"ArnLike": {
70+
"aws:SourceArn": "${S3_BUCKET_ARN}"
71+
}
72+
}
73+
}
74+
]
75+
}
76+
```
77+
78+
Here, you need to replace `${SQS_QUEUE_ARN}` and `${S3_BUCKET_ARN}` with the actual ARN of your SQS queue and S3 bucket.
79+
You can find the ARN of your SQS queue in the existing policy statement (it starts with `arn:aws:sqs:`), and the ARN of your S3 bucket in the S3 console (it starts with `arn:aws:s3:`).
80+
81+
* In the [Amazon S3 Console](https://s3.console.aws.amazon.com/s3/home), open your S3 bucket. Under *Properties* tab, click *Create event notification*.
82+
* Fill in an arbitrary event name, e.g. `S3ChangeNotifications`.
83+
* If you want your AmazonS3 data source to expose a subset of files sharing a prefix, set the same prefix here. Otherwise, leave it empty.
84+
* Select the following event types: *All object create events*, *All object removal events*.
85+
* Select *SQS queue* as the destination, and specify the SQS queue you created above.
86+
87+
AWS's [Guide of Configuring a Bucket for Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html#step1-create-sqs-queue-for-notification) provides more details.
88+
89+
### Spec
90+
91+
The spec takes the following fields:
92+
* `bucket_name` (`str`): Amazon S3 bucket name.
93+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
94+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
95+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
96+
If not specified, all files will be included.
97+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
98+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
99+
If not specified, no files will be excluded.
100+
101+
:::info
102+
103+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
104+
105+
:::
106+
107+
* `sqs_queue_url` (`str`, optional): if provided, the source will receive change event notifications from Amazon S3 via this SQS queue.
108+
109+
:::info
110+
111+
We will delete messages from the queue after they're processed.
112+
If there are unrelated messages in the queue (e.g. test messages that SQS will send automatically on queue creation, messages for a different bucket, for non-included files, etc.), we will delete the message upon receiving it, to avoid repeatedly receiving irrelevant messages after they're redelivered.
113+
114+
:::
115+
116+
### Schema
117+
118+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
119+
120+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
121+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.

docs/docs/sources/azureblob.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: AzureBlob
3+
toc_max_heading_level: 4
4+
description: CocoIndex AzureBlob Built-in Sources
5+
---
6+
7+
The `AzureBlob` source imports files from Azure Blob Storage.
8+
9+
### Setup for Azure Blob Storage
10+
11+
#### Get Started
12+
13+
If you didn't have experience with Azure Blob Storage, you can refer to the [quickstart](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal).
14+
These are actions you need to take:
15+
16+
* Create a storage account in the [Azure Portal](https://portal.azure.com/).
17+
* Create a container in the storage account.
18+
* Upload your files to the container.
19+
* Grant the user / identity / service principal (depends on your authentication method, see below) access to the storage account. At minimum, a **Storage Blob Data Reader** role is needed. See [this doc](https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-data-operations-portal) for reference.
20+
21+
#### Authentication
22+
23+
We support the following authentication methods:
24+
25+
* Shared access signature (SAS) tokens.
26+
You can generate it from the Azure Portal in the settings for a specific container.
27+
You need to provide at least *List* and *Read* permissions when generating the SAS token.
28+
It's a query string in the form of
29+
`sp=rl&st=2025-07-20T09:33:00Z&se=2025-07-19T09:48:53Z&sv=2024-11-04&sr=c&sig=i3FDjsadfklj3%23adsfkk`.
30+
31+
* Storage account access key. You can find it in the Azure Portal in the settings for a specific storage account.
32+
33+
* Default credential. When none of the above is provided, it will use the default credential.
34+
35+
This allows you to connect to Azure services without putting any secrets in the code or flow spec.
36+
It automatically chooses the best authentication method based on your environment:
37+
38+
* On your local machine: uses your Azure CLI login (`az login`) or environment variables.
39+
40+
```sh
41+
az login
42+
# Optional: Set a default subscription if you have more than one
43+
az account set --subscription "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
44+
```
45+
* In Azure (VM, App Service, AKS, etc.): uses the resource’s Managed Identity.
46+
* In automated environments: supports Service Principals via environment variables
47+
* `AZURE_CLIENT_ID`
48+
* `AZURE_TENANT_ID`
49+
* `AZURE_CLIENT_SECRET`
50+
51+
You can refer to [this doc](https://learn.microsoft.com/en-us/azure/developer/python/sdk/authentication/overview) for more details.
52+
53+
### Spec
54+
55+
The spec takes the following fields:
56+
57+
* `account_name` (`str`): the name of the storage account.
58+
* `container_name` (`str`): the name of the container.
59+
* `prefix` (`str`, optional): if provided, only files with path starting with this prefix will be imported.
60+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
61+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
62+
If not specified, all files will be included.
63+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["*.tmp", "**/*.log"]`.
64+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
65+
If not specified, no files will be excluded.
66+
* `sas_token` (`cocoindex.TransientAuthEntryReference[str]`, optional): a SAS token for authentication.
67+
* `account_access_key` (`cocoindex.TransientAuthEntryReference[str]`, optional): an account access key for authentication.
68+
69+
:::info
70+
71+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
72+
73+
:::
74+
75+
### Schema
76+
77+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
78+
79+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`.
80+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.

docs/docs/sources/googledrive.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
title: GoogleDrive
3+
toc_max_heading_level: 4
4+
description: CocoIndex GoogleDrive Built-in Sources
5+
---
6+
7+
The `GoogleDrive` source imports files from Google Drive.
8+
9+
### Setup for Google Drive
10+
11+
To access files in Google Drive, the `GoogleDrive` source will need to authenticate by service accounts.
12+
13+
1. Register / login in **Google Cloud**.
14+
2. In [**Google Cloud Console**](https://console.cloud.google.com/), search for *Service Accounts*, to enter the *IAM & Admin / Service Accounts* page.
15+
- **Create a new service account**: Click *+ Create Service Account*. Follow the instructions to finish service account creation.
16+
- **Add a key and download the credential**: Under "Actions" for this new service account, click *Manage keys**Add key**Create new key**JSON*.
17+
Download the key file to a safe place.
18+
3. In **Google Cloud Console**, search for *Google Drive API*. Enable this API.
19+
4. In **Google Drive**, share the folders containing files that need to be imported through your source with the service account's email address.
20+
**Viewer permission** is sufficient.
21+
- The email address can be found under the *IAM & Admin / Service Accounts* page (in Step 2), in the format of `{service-account-id}@{gcp-project-id}.iam.gserviceaccount.com`.
22+
- Copy the folder ID. Folder ID can be found from the last part of the folder's URL, e.g. `https://drive.google.com/drive/u/0/folders/{folder-id}` or `https://drive.google.com/drive/folders/{folder-id}?usp=drive_link`.
23+
24+
25+
### Spec
26+
27+
The spec takes the following fields:
28+
29+
* `service_account_credential_path` (`str`): full path to the service account credential file in JSON format.
30+
* `root_folder_ids` (`list[str]`): a list of Google Drive folder IDs to import files from.
31+
* `binary` (`bool`, optional): whether reading files as binary (instead of text).
32+
* `recent_changes_poll_interval` (`datetime.timedelta`, optional): when set, this source provides a change capture mechanism by polling Google Drive for recent modified files periodically.
33+
34+
:::info
35+
36+
Since it only retrieves metadata for recent modified files (up to the previous poll) during polling,
37+
it's typically cheaper than a full refresh by setting the [refresh interval](/docs/core/flow_def#refresh-interval) especially when the folder contains a large number of files.
38+
So you can usually set it with a smaller value compared to the `refresh_interval`.
39+
40+
On the other hand, this only detects changes for files that still exist.
41+
If the file is deleted (or the current account no longer has access to it), this change will not be detected by this change stream.
42+
43+
So when a `GoogleDrive` source has `recent_changes_poll_interval` enabled, it's still recommended to set a `refresh_interval`, with a larger value.
44+
So that most changes can be covered by polling recent changes (with low latency, like 10 seconds), and remaining changes (files no longer exist or accessible) will still be covered (with a higher latency, like 5 minutes, and should be larger if you have a huge number of files like 1M).
45+
In reality, configure them based on your requirement: how fresh do you need the target index to be?
46+
47+
:::
48+
49+
### Schema
50+
51+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
52+
53+
* `file_id` (*Str*, key): the ID of the file in Google Drive.
54+
* `filename` (*Str*): the filename of the file, without the path, e.g. `"file1.md"`
55+
* `mime_type` (*Str*): the MIME type of the file.
56+
* `content` (*Str* if `binary` is `False`, otherwise *Bytes*): the content of the file.

docs/docs/sources/localfile.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
title: LocalFile
3+
toc_max_heading_level: 4
4+
description: CocoIndex LocalFile Built-in Sources
5+
---
6+
7+
The `LocalFile` source imports files from a local file system.
8+
9+
### Spec
10+
11+
The spec takes the following fields:
12+
* `path` (`str`): full path of the root directory to import files from
13+
* `binary` (`bool`, optional): whether reading files as binary (instead of text)
14+
* `included_patterns` (`list[str]`, optional): a list of glob patterns to include files, e.g. `["*.txt", "docs/**/*.md"]`.
15+
If not specified, all files will be included.
16+
* `excluded_patterns` (`list[str]`, optional): a list of glob patterns to exclude files, e.g. `["tmp", "**/node_modules"]`.
17+
Any file or directory matching these patterns will be excluded even if they match `included_patterns`.
18+
If not specified, no files will be excluded.
19+
20+
:::info
21+
22+
`included_patterns` and `excluded_patterns` are using Unix-style glob syntax. See [globset syntax](https://docs.rs/globset/latest/globset/index.html#syntax) for the details.
23+
24+
:::
25+
26+
### Schema
27+
28+
The output is a [*KTable*](/docs/core/data_types#ktable) with the following sub fields:
29+
* `filename` (*Str*, key): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
30+
* `content` (*Str* if `binary` is `False`, *Bytes* otherwise): the content of the file

docs/docs/sources/postgres.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: Postgres
3+
toc_max_heading_level: 4
4+
description: CocoIndex Postgres Built-in Sources
5+
---
6+
7+
The `Postgres` source imports rows from a PostgreSQL table.
8+
9+
### Setup for PostgreSQL
10+
11+
* Ensure the table exists and has a primary key. Tables without a primary key are not supported.
12+
* Grant the connecting user read permissions on the target table (e.g. `SELECT`).
13+
* Provide a database connection. You can:
14+
* Use CocoIndex's default database connection, or
15+
* Provide an explicit connection via a transient auth entry referencing a `DatabaseConnectionSpec` with a `url`, for example:
16+
17+
```python
18+
cocoindex.add_transient_auth_entry(
19+
cocoindex.sources.DatabaseConnectionSpec(
20+
url="postgres://user:password@host:5432/dbname?sslmode=require",
21+
)
22+
)
23+
```
24+
25+
### Spec
26+
27+
The spec takes the following fields:
28+
29+
* `table_name` (`str`): the PostgreSQL table to read from.
30+
* `database` (`cocoindex.TransientAuthEntryReference[DatabaseConnectionSpec]`, optional): database connection reference. If not provided, the default CocoIndex database is used.
31+
* `included_columns` (`list[str]`, optional): non-primary-key columns to include. If not specified, all non-PK columns are included.
32+
* `ordinal_column` (`str`, optional): to specify a non-primary-key column used for change tracking and ordering, e.g. can be a modified timestamp or a monotonic version number. Supported types are integer-like (`bigint`/`integer`) and timestamps (`timestamp`, `timestamptz`).
33+
`ordinal_column` must not be a primary key column.
34+
* `notification` (`cocoindex.sources.PostgresNotification`, optional): when present, enable change capture based on Postgres LISTEN/NOTIFY. It has the following fields:
35+
* `channel_name` (`str`, optional): the Postgres notification channel to listen on. CocoIndex will automatically create the channel with the given name. If omitted, CocoIndex uses `{flow_name}__{source_name}__cocoindex`.
36+
37+
:::info
38+
39+
If `notification` is provided, CocoIndex listens for row changes using Postgres LISTEN/NOTIFY and creates the required database objects on demand when the flow starts listening:
40+
41+
- Function to create notification message: `{channel_name}_n`.
42+
- Trigger to react to table changes: `{channel_name}_t` on the specified `table_name`.
43+
44+
Creation is automatic when listening begins.
45+
46+
Currently CocoIndex doesn't automatically clean up these objects when the flow is dropped (unlike targets)
47+
It's usually OK to leave them as they are, but if you want to clean them up, you can run the following SQL statements to manually drop them:
48+
49+
```sql
50+
DROP TRIGGER IF EXISTS {channel_name}_t ON "{table_name}";
51+
DROP FUNCTION IF EXISTS {channel_name}_n();
52+
```
53+
54+
:::
55+
56+
### Schema
57+
58+
The output is a [*KTable*](/docs/core/data_types#ktable) with straightforward 1 to 1 mapping from Postgres table columns to CocoIndex table fields:
59+
60+
* Key fields: All primary key columns in the Postgres table will be included automatically as key fields.
61+
* Value fields: All non-primary-key columns in the Postgres table (included by `included_columns` or all when not specified) appear as value fields.
62+
63+
### Example
64+
65+
You can find end-to-end example using Postgres source at:
66+
* [examples/postgres_source](https://github.com/cocoindex-io/cocoindex/tree/main/examples/postgres_source)

docs/sidebars.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,11 @@ const sidebars: SidebarsConfig = {
4848
link: { type: 'doc', id: 'sources/index' },
4949
collapsed: true,
5050
items: [
51+
'sources/amazons3',
52+
'sources/azureblob',
53+
'sources/googledrive',
54+
'sources/localfile',
55+
'sources/postgres',
5156
],
5257
},
5358
{

0 commit comments

Comments
 (0)