Skip to content

Issue 4130 similaritySearch throwing Exception due to schema name fix #4166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wilocu
Copy link
Contributor

@wilocu wilocu commented Aug 15, 2025

#4130

Problem
When using PgVectorStore with schema names containing hyphens (e.g., demo-1998), similaritySearch() operations would throw SQL grammar exceptions due to
unquoted PostgreSQL identifiers in generated SQL.

Solution

  • Added proper PostgreSQL identifier quoting in getFullyQualifiedTableName() to generate "schema"."table" format
  • Updated schema creation DDL to use quoted schema names
  • Enhanced schema validator to accept hyphenated names while maintaining security against SQL injection
  • Updated error messages to show proper quoted syntax in help text

Changes

  • PgVectorStore.java: Fixed SQL generation to use quoted identifiers
  • PgVectorSchemaValidator.java: Updated validation logic and error messages
  • PgVectorStoreSchemaQuotingTest.java: Added test coverage for hyphenated schema names

Example

Before (❌ SQL error):
SELECT * FROM demo-1998.vector_store -- Syntax error

After (✅ Valid PostgreSQL):
SELECT * FROM "demo-1998"."vector_store" -- Works correctly

Testing

  • All existing tests pass
  • New test verifies proper quoting for hyphenated schema names
  • Verified SQL generation produces valid PostgreSQL syntax

Signed-off-by: Mattia Pasetto [email protected]

I am still pretty new to contributions, please double check and let me know of any errors.

Signed-off-by: Mattia Pasetto <[email protected]>
@markpollack
Copy link
Member

Thanks! There is certainly a bug to fix.

I am not an expert on these database best practices, so I turned to some investigation on chatgpt.

Here is the reply. would like to know your thoughts @wilocu

This is a real bug, and the core fix (always quoting identifiers) is right. The piece to adjust is the validator.


Quoting identifiers in getFullyQualifiedTableName() and DDL is the correct fix for the hyphenated-schema bug. A couple of tweaks will align this with PostgreSQL rules and avoid false rejections:

  1. Identifier rules
  • PostgreSQL limits identifiers to 63 bytes (UTF-8), not 64 chars.
  • Numeric names, reserved words, and symbols (e.g., -, .) are legal when quoted.
  • Double quotes inside names must be escaped as "".
  1. Validator
  • Drop bans on numeric-only, keywords, ;, --, etc. If we always quote+escape, those are safe.
  • Keep: non-null, non-empty, ≤63 bytes, and no NUL/control chars.
  • Existence check: use SELECT to_regclass(quote_ident($1)||'.'||quote_ident($2)) IS NOT NULL to respect case/quoting instead of information_schema.
  1. Quoting helper
    Ensure the FQN builder escapes quotes:
private static String q(String s) { return "\"" + s.replace("\"", "\"\"") + "\""; }
// usage: q(schema) + "." + q(table)
  1. Tests
    Add cases for "demo-1998", "123", "select", a"b, non-ASCII, and 63-byte boundary.

With those adjustments, the quoting fix resolves the bug without over-restricting valid PostgreSQL identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants