Skip to content

Conversation

@jcflack
Copy link
Contributor

@jcflack jcflack commented Jan 24, 2022

As a work-in-progress pull request, this is not expected to be imminently merged, but is here to document the objectives and progress of the ongoing work.

Why needed

A great advantage promised by a PL based on the JVM is the large ecosystem of languages other than Java that can be supported on the same infrastructure, whether through the Java Scripting (JSR 223) API, or through the polyglot facilities of GraalVM, or simply via separate compilation to the class file format and loading as jars.

However, PL/Java, with its origins in 2004 predating most of those developments, has architectural limitations that stand in the way.

JDBC

One of the limitations is the centrality of the JDBC API. To be sure, it is a standard in the Java world for access to a database, and for PL/Java to conform to ISO SQL/JRT, the JDBC API must be available. But it is not necessarily a preferred or natural database API for other JVM or GraalVM languages, and its design goal is to abstract away from the specifics of an underlying database, which ends up complicating or even preventing access to advanced PostgreSQL capabilities that could be prime drivers for running server-side code in the first place.

The problem is not that JDBC is an available API in PL/Java, but that it is the fundamental API in PL/Java, with its tentacles reaching right into the native C language portion of PL/Java's implementation. That has made alternative interface options impractical, and multiplied the maintenance burden of even simple tasks like adding support for new datatype mappings or fixing simple bugs. There are significant portions of JDBC 4 that remain unimplemented in PL/Java.

Experience building an implementation of ISO SQL/XML XMLQUERY showed that certain requirements of the spec were simply unsatisfiable atop JDBC, either because of inherent JDBC limitations or limits in PL/Java's implementation of it. An example of each kind:

  • The INTERVAL data type cannot be mapped as SQL/XML requires, because the only ResultSetMetadata methods JDBC defines for access to a type modifier are precision and scale, which apply to numeric values; the API defines no standard way to learn what the modifier of an INTERVAL says about whether months or days are present.
  • The DECIMAL type cannot be mapped as SQL/XML requires; for that case, the fault is not with JDBC (which defines the precision and scale methods), but with their incomplete implementation in PL/Java.

Those cases also illustrate that mapping some PostgreSQL data types to those of another language can be complex. An arbitrary PostgreSQL INTERVAL is representable as neither a java.time.Period nor a java.time.Duration alone (though a pair of the two can be used, a type that PGJDBC-NG offers). One or the other can suffice if the type modifier is known and limits the fields present. A PostgreSQL NUMERIC value has not-a-number and signed infinity values that some candidate language-library type might not, and an internal precision that its text representation does not reveal, which might need to be preserved for a mathematically demanding task. The details of converting it to another language's similar type need to be knowable or controllable by an application.

It is a goal of this work to give PL/Java an API that does not obscure or abstract from PostgreSQL details, but makes them accessible in a natural Java idiom, and that such a "natural PostgreSQL" API should be adequate to allow building a JDBC layer in pure Java above it. (The work of building such a JDBC layer is not in the scope of this pull request.)

Parameter and return-value mapping

PL/Java uses a simple, Java-centric approach where a Java method is declared naturally, giving ordinary Java types for its parameters and return, and the mappings from these to the PostgreSQL parameter and return types are chosen by PL/Java and applied transparently (and much of that happens deep in PL/Java's C code).

While convenient, that approach isn't easily adapted to other JVM languages that may offer other selections of types. Even for Java, it stands in the way of doing certain things possible in PostgreSQL, like declaring VARIADIC "any" functions.

In a modernized API, it needs to be possible to declare a function whose parameter represents the PostgreSQL FunctionCallInfo, so that the parameters and their types can be examined and converted in Java. That will make it possible to write language handlers in Java, whether for other JVM languages or for the existing PL/Java calling conventions that at present are tangled in C.

Elements of new API

Identification of data types

A PostgreSQL-specific API must be able to refer unambiguously to any type known to the database, so it cannot rely on any fixed set of generic types such as JDBCType. To interoperate with a JDBC layer, though, the identifier for types should implement JDBC's SQLType interface.

The API should support retrieving enough metadata about the type for a JDBC layer implemented above it to be able to report complete ResultSetMetaData information.

The new class serving this purpose is RegType.

As RegType implements the java.sql.SQLType interface, an aliasing issue arises for a JDBC layer. Such a layer should accept JDBCType.VARCHAR as an alias for RegType.VARCHAR, for example. JDBC itself has no methods that return an SQLType instance, so the question of whether it should return the generic JDBC type or the true RegType does not arise. A PL/Java-specific API is needed for retrieving the type identifier in any case.

The details of which JDBC types are considered aliases of which RegTypes will naturally belong in a JDBC API layer. At the level of this underlying API, a RegType is what identifies a PostgreSQL type.

While RegType includes convenience final fields for a number of common types, those by no means limit the RegTypes available. There is a RegType that can be obtained for every type known to the database, whether built in, extension-supplied, or user-defined.

Other PostgreSQL catalog objects and key abstractions

RegType is one among the types of PostgreSQL catalog objects modeled in the org.postgresql.pljava.model package.

Along with a number of catalog object types, the package also contains:

  • TupleDescriptor and TupleTableSlot, the key abstractions for fetching and storing database values. TupleTableSlot in PostgreSQL is already a useful abstraction over a few different representations; in PL/Java it is further abstracted, and can present with the same API other collections of typed, possibly named, items, such as arrays, the arguments in a function call, etc.
  • MemoryContext and ResourceOwner, both subtypes of Lifespan, usable to guard Java objects that have native state whose validity is bounded in time
  • CharsetEncoding

Mapping PostgreSQL data types to what a PL supports

The Adapter class

A mapping between a PostgreSQL data type and a suitable PL data type is an instance of the Adapter class, and more specifically of the reference-returning Adapter.As<T,U> or one of the primitive-returning Adapter.AsInt<U>, Adapter.AsFloat<U>, and so on (one for each Java primitive type). The Java type produced is T for the As case, and implicit in the class name for the AsFoo cases.

The basic method for fetching a value from a TupleTableSlot is get(Attribute att, Adapter adp), and naturally is overloaded and generic so that get with an As<T,?> adapter returns a T, get with an AsInt<?> adapter returns an int, and so on. (But see this later comment below for a better API than this item-at-a-time stuff.) (The U type parameter of an adapter plays a role when adapters are combined by composition, as discussed below, and is otherwise usually uninteresting to client code, which may wildcard it, as seen above.)

A manager class for adapters

Natural use of this idiom presumes there will be some adapter-manager API that allows client code to request an adapter for some PostgreSQL type by specifying a Java witness class Class<T> or some form of super type token, and returns the adapter with the expected compile-time parameterized type.

That manager hasn't been built yet, but the requirements are straightforward and no thorny bits are foreseen. (Within the org.postgresql.pljava.internal module itself, things are simpler; no manager is needed, and code refers directly to static final INSTANCE fields of existing adapters.)

Extensibility

PL/Java has historically supported user-defined types implemented in Java, a special class of data types whose Java representations must implement a certain JDBC interface and import and export values through a matching JDBC API. In contrast, PL/Java's first-class PostgreSQL data type support—the mappings it supplies between PostgreSQL and ordinary Java types that don't involve the specialized JDBC user-defined type APIs—has been hardcoded in C using Java Native Interface (JNI) calls, and not straightforward to extend. That's a pain point for several situations:

  • A mapping for another PostgreSQL data type (either a type newly added to PostgreSQL, or simply one that PL/Java does not yet have a mapping for) is not easily added for an application that needs it, but generally must be added in PL/Java's C/JNI internals and made available in a new PL/Java build.
  • A mapping of an existing PostgreSQL data type to a new or different Java type—same story. When Java 8 introduced the java.time package, developers wishing to have PL/Java map PostgreSQL's date and time types to the improved Java types instead of the older java.sql ones had to open issues requesting that ability and wait for a PL/Java release to include it.
  • Not every PostgreSQL data type has a single best PL type to be mapped to. One application using the geometric types might want them mapped to the Java types in the PGJDBC library, while another might prefer the 2D classes supplied by some Java geometry library. One application might want a PostgreSQL array mapped to a flat Java List, another to a multi-dimensioned Java array, another to a matrix class from a scientific computation library. The choices multiply when considering the data types not only of Java but of other JVM languages. C coding and rebuilding of PL/Java should not be needed to tailor these mappings.

Adapters implementable in pure Java

With this PR, code external to PL/Java's implementation can supply adapters, built against the service-provider API exposed in org.postgresql.pljava.adt.spi.

Leaf adapters

A "leaf" adapter is one that directly knows the PostgreSQL datum format of its data type, and maps that to a suitable PL type. Only a leaf adapter gets access to PostgreSQL datums, which it should not leak to other code. Code that defines leaf adapters must be granted a permission in pljava.policy.

Composing adapters

A composing, or non-leaf, adapter is one meant to be composed over another adapter. An example would be an adapter that composes over an adapter returning type T (possibly null) to form an adapter returning Optional<T>. With a selection of common composing adapters (there aren't any in this pull request, yet), it isn't necessary to provide leaf adapters covering all the ways application code might want data to be presented. No special permission is needed to create a composing adapter.

Java's generic types are erased to raw types for runtime, but the Java compiler saves the parameter information for runtime access through Java reflection. As adapters are composed, the Adapter class tracks the type relationships so that, for example, an Adapter<Optional<T>,T> composed over an Adapter<String,Void> is known to produce Optional<String>.

It is that information that will allow an adapter manager to satisfy a request to map a given PostgreSQL type to some PL type, by finding and composing available adapters.

Contract-based adapters

For a PostgreSQL data type that doesn't have one obvious best mapping to a PL type (perhaps because there are multiple choices with different advantages, or because there is no suitable type in the PL's base library, and any application will want the type mapped to something in a chosen third-party library), a contract-based adapter may be best. An Adapter.Contract is a functional interface with parameters that define the semantically-important components of the PostgreSQL type, and a generic return type, so an implementation can return any desired representation for the type.

A contract-based adapter is a leaf adapter class with a constructor that accepts a Contract, producing an adapter between the PostgreSQL type and whatever PL type the contract maps it to. The adapter encapsulates the internal details of how a PostgreSQL datum encodes the value, and the contract exposes the semantic details needed to faithfully map the type. Contracts for many existing PostgreSQL types are provided in the org.postgresql.pljava.adt package.

ArrayAdapter

The one supplied ArrayAdapter is contract-based. While a Contract.Array has a single abstract method, and therefore could serve as a functional interface, in practice it is not directly implementable by a lambda; there must be a subclass or subinterface (possibly anonymous) whose type parameterization the Java compiler can record. (A lambda may then be used to instantiate that.) An instance of ArrayAdapter is constructed by supplying an adapter for the array's element type along with an array contract targeting some kind of collection of the mapped type. As with a composing adapter, the Adapter class substitutes the element adapter's target Java type through the type parameters of the array contract, to arrive at the actual parameterized type of the resulting array or collection.

PostgreSQL arrays can be multidimensional, and are regular (not "jagged"; all sub-arrays at a given dimension match in size). They can have null elements, which are tracked in a bitmap, offering a simple way to save some space for arrays that are sparse; there are no other, more specialized sparse-array provisions.

Array indices need not be 0- or 1-based; the base index as well as the index range can be given independently for each dimension. PostgreSQL creates 1-based arrays by default. This information is stored with the array value, not with the array type, so a column declared with an array type could conceivably have values of different cardinalities or even dimensionalities.

The adapter is contract-based because there are many ways application code could want a PostgreSQL array to be presented: as a List or single Java array (flattening multiple dimensions, if present, to one, and disregarding the base index), as a Java array-of-arrays, as a JDBC Array object (which does not officially contemplate more than one array dimension, but PostgreSQL's JDBC drivers have used it to represent multidimensioned arrays), as the matrix type offered by some scientific computation library, and so on.

For now, one predefined contract is supplied, AsFlatList, and a static method, nullsIncludedCopy, that can be used (via method reference) as one implementation of that contract.

Java array-of-arrays

While perhaps not an extremely efficient way to represent multidimensional arrays, the Java array-of-arrays approach is familiar, and benefits from a bit of dedicated support for it in Adapter. Therefore, if you have an Adapter a that renders a PostgreSQL type Foo as Java type Bar, you can use, for example, a.a2().build() to obtain an Adapter from the PostgreSQL array type Foo[] to the Java type Bar[][], requiring the PostgreSQL array to have two dimensions, allowing each value to have different sizes along those dimensions, but disregarding the PostgreSQL array's start indices (all Java arrays start at 0).

Because PostgreSQL stores the dimension information with each value and does not enforce it for a column as a whole, it could be possible for a column of array values to include values with other numbers of dimensions, which an adapter constructed this way will reject. On the other hand, the sizes along each dimension are also allowed by PostgreSQL to vary from one value to the next, and this adapter accommodates that, as long as the number of dimensions doesn't change.

The existing contract-based ArrayAdapter is used behind the scenes, but build() takes care of generating the contract. Examples are provided.

Adapter maintainability

Providing pure-Java adapters that know the internal layouts of PostgreSQL data types, without relying on JNI calls and the PostgreSQL native support routines, entails a parallel-implementation maintenance responsibility roughly comparable to that of PostgreSQL client drivers that support binary send and receive. (The risk is slightly higher because the backend internal layouts are less committed than the send/receive representations. Because they are used for data on disk, though, historically they have not changed often or capriciously.)

The engineering judgment is that the resulting burden will be manageable, and the benefits in clarity and maintainability of the pure-Java implementations, compared to the brittle legacy Java+C+JNI approach, will predominate. The process of developing clear contracts for PostgreSQL types already has led to discovery of one bug (#390) that could be fixed in the legacy conversions.

For the adapters supplied in the org.postgresql.pljava.internal module, it is possible to use ModelConstants.java/ModelConstants.c to ensure that key constants (offsets, flags, etc.) stay synchronized with their counterparts in the PostgreSQL C code.

Adapter is a class in the API module, with the express intent that other adapters can be developed, and found by the adapter manager through a ServiceLoader API, without being internal to PL/Java. Those might not have the same opportunity for build-time checking against PostgreSQL header files, and will have to rely more heavily on regression tests for key data values, much as binary-supporting client drivers must. The same can be true even for PL/Java internal adapters for a few PostgreSQL data types whose C implementations are so strongly encapsulated (numeric comes to mind) that necessary layouts and constants do not appear in .h files.

Known open items

In no well-defined order ....

  • The to-PostgreSQL direction for Adapter, TupleTableSlot, and Datum.Accessor. These all have API and implementation for getting PostgreSQL values and presenting them in Java. Now the other direction is needed.
  • Provide API and implementation for a unified list-of-slots representation for a variety of list-of-tuple representations used in PostgreSQL, by:
    • factoring out the list-of-TupleTableSlot classes currently found as preliminary scaffolding in TupleTableSlot.java
    • providing such a representation for SPITupleTable ...
    • CatCList ...
    • Tuplestore? ...
    • ...?
  • Implement some form of offset memoization so fetching attributes from a heap TupleTableSlot stays subquadratic
  • Finish the unimplemented grants methods of RegRole and the unmplemented unary one of CatalogObject.AccessControlled. (Needs the CatCList support, for pg_auth_members searches.)
  • A NullableDatum flavor of TupleTableSlot. One of the last prerequisites to enable pure-Java language-handler implementations, to which the function arguments will appear as a TupleTableSlot.
  • Complete the implementation of isSubtype with the rules from Java Language Specification 4.10. (At present it is a stub that only checks erased subtyping, enough to get things initially going.)
  • The adapter manager described above. (Requires isSubtype.)
  • Adapters for PostgreSQL types that don't have them yet (starting, perhaps, with the ones that already have contracts defined in org.postgresql.pljava.adt).
  • TextAdapter does not yet support the type modifiers for CHAR and VARCHAR. It needs a contract-based flavor that does.
  • ArrayAdapter (or Contract.Array) should supply at least one convenience method, taking a dimsAndBounds array parameter and generating an indexing function (a MethodHandle?) that has nDims integer parameters and returns an integer flat index. Other related operations? An index enumerator, etc.?
  • A useful initial set of composing adapters, such as:
    • one of the form As<Optional<T>,T>
      • implement in an example class
      • integrate into PL/Java proper
    • one extending As<T,T> that returns null for null and values unchanged
      • why? because with adapter autoboxing, it can be composed over any primitive-returning adapter to enable it to handle null, by returning its boxed form
      • implement in an example class
      • integrate into PL/Java proper
    • a set composing over primitive adapters to use a specified value in the primitive's value space to represent null.
      • implement in an example class
      • complete the set and integrate into PL/Java proper
  • More work on CatalogObject invalidation. RegClass and RegType are already invalidated selectively; probably RegProcedure should be also. PostgreSQL has a limited number of callback slots, so it would be antisocial to grab them for all the supported classes: less critical ones just depend on the global switchpoint; come up with a good story for invalidating those. Also for how TupleDescriptor should behave upon invalidation of its RegClass. See commit comments for 5adf2c8.
  • Better define and implement the DualState behavior of TupleTableSlot.
  • Reduce the C-centricity of VarlenaWrapper. Goal: DatumUtils.mapVarlena doing more in Java, less in C.
    • more of VarlenaWrapper's functionality moved to DatumImpl
    • client code no longer casting Datum.Input to VarlenaWrapper to use it.
  • Adapter should have control over the park/fetch/decompress/lifespan decisions for VarlenaWrapper; currently the behavior is hardcoded for top-transaction lifespan, lazy detoasting, appropriate for SQLXML, which was the first VarlenaWrapper client.
  • Add MBeans with statistics for the new caches

And then

  • Choose some interesting JVM language foo and implement a simple PL/foo in pure Java, using these facilities.
  • Reimplement PL/Java's own language handler the same way.

Tweak invocation.c so the stack-allocated space provided by the caller
is used to save the prior state rather than to construct the new state.
This way, the current state can have a fixed address (currentInvocation
is a constant pointer) and can be covered by a single static
ByteBuffer that Invocation.java can read/write through without relying
on JNI methods.

As Invocation isn't a JDBC-specific concept or class, it has never
made much sense to have it in the .jdbc package. Move it to .internal.
Both values have just been stashed by stashCallContext.
Both will be restored 14 lines later by _closeIteration.
And nothing in those 14 lines cares about them.
After surveying the code for where function return values can
be constructed, add one switchToUpperContext() around the construction
of non-composite SRF return values, where it was missing, so such values
can be returned correctly after SPI_finish(), and so the former,
very hacky, cross-invocation retention of SPI contexts can be sent
to pasture.

For the record, these are the notes from that survey of the code:

Function results, non-set-returning:
 Type_invoke:
  the inherited _Type_invoke calls ->coerceObject, within sTUC.
  sub"class"es that override it:
   Boolean,Byte,Double,Float,Integer,Long,Short,Void:
   - overridden in order to use appropriately-typed JNI invoke method
   - Double,Float,Long have _asDatum that does sTUC;
     . historical artifact; those types were !byval before PG 8.4
   - the rest do not sTUC; should be ok, all byval
   Coerce: does sTUC
   Composite: does sTUC around _getTupleAndClear
 Arrays:
  createArrayType (extern, in Array.c) does sTUC. So far so good.
  What about !byval elements stored into the array?
   the non-primitive/any types don't override _Array_coerceObject,
   which is where Type_coerceObject on each element, and construct_md_array
   are called. With no sTUC. Around construct_md_array is really where it's
   needed.
   But then, _Array_coerceObject is still being called within sTUC
   of _Type_invoke. All good.
   Hmm: !byval elements of values[] are leaked when pfree(values) happens.
   They should be pfree'd unconditionally; construct_md_array copies them.
 What about UDTs?
  They don't override _Type_invoke.
  So they inherit the one that calls ->coerceObject, within sTUC.
  That ought to be enough. UDT.c's coerceScalarObject itself also sTUCs,
  inconsistently, for fixed-length and varlena types but not NUL-terminated.
  That should be ok, and merely redundant. In coerceTupleObject, no sTUC
  appears. Again, by inheritance of coerceObject, that should be ok.
  Absent that, sTUC around the SQLOutputToTuple_getTuple should be adequate;
  only if that could produce a tuple with TOAST pointers would it also be
  necessary around the HeapTupleGetDatum.


Function results, set-returning:
 _datumFromSRF is applied to each row result
 The inherited _datumFromSRF calls Type_coerceObject, NOT within sTUC
  XXX this, at least, definitely needs a sTUC added.
 sub"class"es that override it:
  only Composite: calls _getTupleAndClear, NOT within sTUC. But it
  works out, just because TupleDesc.java's native _formTuple method uses
  JavaMemoryContext. Spooky action at a distance?


Results from triggers:
 Function.c's invokeTrigger does sTUC around the getTriggerReturnTuple.
In passing, fix a long-standing thinko in Invocation_popInvocation:
the memory context that was current on entry is stored in upperContext
of *this* Invocation, but popInvocation was 'restoring' the one that was
saved in the *previous* Invocation.

Also in passing, move the cleanEnqueuedInstances step later in the
pop sequence, improving its chance of seeing instances that could become
unreachable through the release of SPI contexts or the JNI local frame.
This can reveal issues with the nesting of SPI 'connections' or
management of their associated memory contexts.
Without the special treatment, the instance of the Java class
Invocation, if any, that corresponds to the C Invocation, has its
lifetime simply bounded to that of the C Invocation, rather than
artificially extended across a sequence of SRF value-per-call
invocations. It is simpler, does not break any existing tests, and
is less likely to be violating PostgreSQL assumptions on correct
behavior.
The commits merged here into this branch simplify PL/Java's management
of the PostgreSQL-to-PL/Java-function invocation stack, and especially
simplify the handling of SPI (PostgreSQL's Server Programming Interface)
and set-returning functions.

SPI includes "connect" and "finish" operations normally used in a simple
pattern: connect before using SPI functions, finish when done and before
returning to the caller, and if anything allocated while "connected" is
to be returned to the caller, be sure to allocate that in the "upper
executor" memory context (that is, the context that was current before
SPI_connect).

PL/Java has long diverged from that approach, especially for the case
of set-returning functions using the value-per-call protocol (the only
one PL/Java currently supports). If SPI was connected during one call
in the sequence, PL/Java has sought to save and reuse that connection
and its memory contexts over later calls (where a simpler, "by the book"
implementation would simply SPI_connect and SPI_finish within the
individual calls as needed).

It never seemed altogether clear that was a good idea, but at the same
time there weren't field reports of failure. It turns out, though, not
hard to construct tests showing the apparent success was all luck.

It has not been much trouble to reorganize that code so that SPI is used
in the much simpler, by-the-book fashion. b2094ba fixes one place where
a needed switchToUpperContext was missing but the error was masked
by the former SPI juggling, and with that fixed, all the tests in
the CI script promptly passed, with SPI used in the purely nested way
that it expects.

One other piece of complexity that has been removed was the handling of
Java Invocation objects during set-returning functions. Although
the stack-allocated C invocation struct naturally lasts only through one
actual call, PL/Java's SRF code took pains to keep its Java counterpart
alive, as if the one instance represented the entire sequence of actual
calls while returning a set. Eliminating that behavior has simplified
the code and shown no adverse effect in the available tests.

As these are changes of some significance that might possibly alter
some behavior not tested here, they have not been made in the 1.6 or
1.5 branches. But the simplification seems to make a less brittle base
for the development going forward on this branch.
CacheMap is a generic class useful for (possibly weak or soft)
canonicalizing caches of things that are identified by one or more
primitive values. (Writing the key values into a ByteBuffer avoids
the allocation involved in boxing them; however, the API as it
currently stands might be exceeding that cost with instantiation
of lambdas. It should eventually be profiled, and possibly revised
into a less tidy, but more efficient, form.)

SwitchPointCache is intended for lazily caching numerous values
of diverse types, groups of which can be associated with a single
SwitchPoint for purposes of invalidation.

As currently structured, the SwitchPoints (and their dependent
GuardWithTest nodes) do not get stored in static final fields;
this may limit HotSpot's ability to optimize them as fully as
it could if they did.
Adapter is the abstract ancestor of all classes that implement
PostgreSQL datatypes for PL/Java, and the adt.spi package contains
classes that will be of use to datatype-implementing code:
in particular, Datum. PostgreSQL datums are only exposed
to Adapters, and the Adapter's job is to reliably convert between
the PostgreSQL type and some appropriate Java representation.

For some datatypes, there is a single or obvious appropriate Java
representation, and an Adapter may be provided that simply produces
that. For other datatypes, there may be no single obvious choice
of Java representation, either because there is no good match or
because there are several; an application might want to map types
to specialized classes available in some domain-specific library.
To serve those cases, Adapters can be defined in terms of
Adapter.Contract subinterfaces, which are simply functional interfaces
that document and expose the semantic components of the PostgreSQL
type. For example, a contract for PostgreSQL INTERVAL would expose
a 64-bit microseconds component, a 32-bit day count, and a 32-bit
month count. The division of responsibility is that the Adapter
encapsulates how to extract those components given a PostgreSQL
datum, but the contract fixes the semantics of what the components
are. It is then simple to use the Adapter, with any lambda that
conforms to the contract, to produce any desired Java representation
of the type.

Dummy versions of Attribute, RegClass, RegType, TupleDescriptor,
and TupleTableSlot break ground here on the model package, which
will consist of a set of classes modeling key PostgreSQL abstractions
and a useful subset of the PostgreSQL system catalogs.

RegType also implements java.sql.SQLType, making it usable in
(a suitable implementation of) JDBC to specify PostgreSQL types
precisely.

adt.spi.AbstractType needs the specialization() method that was
earlier added to internal.Function in anticipation of needing it
someday.
The org.postgresql.pljava.adt package contains 'contracts'
(subinterfaces of Adapter.Contract.Scalar or Adapter.Contract.Array),
which are functional interfaces that document and expose the exact
semantic components of PostgreSQL data types.

Adapters are responsible for the internal details of PostgreSQL's
representation that aren't semantically important, and code that
simply needs to construct some semantically faithful representation
of the type only needs to be concerned with the contract.
CharsetEncoding is not really a catalog object (the available
encodings in PostgreSQL are hardcoded) but is exposed here as
a similar kind of object with useful operations, including
encoding and decoding using the corresponding Java codec when
known.

CatalogObject is, of course, the superinterface of all things
that really are catalog objects (identified by a classId, an objectId,
and rarely a subId). This commit brings in RegNamespace and RegRole
as needed for CatalogObject.Namespaced and CatalogObject.Owned.
RolePrincipal is a bridge between a RegRole and Java's Principal
interface.

CatalogObject.Factory is a service interface 'used' by the API
module, and will be 'provided' by the internals module to supply
the implementations of these things.
And convert other code to use CharsetEncoding.SERVER_ENCODING
where earlier hacks were used, like the implServerCharset()
added to Session in 1.5.1.

In passing, fix a bit of overlooked java7ification in SQLXMLImpl.

The new CharsetEncodings example provides two functions:

SELECT * FROM javatest.charsets();

returns a table of the available PostgreSQL encodings, and what Java
encodings they could be matched up with.

SELECT * FROM javatest.java_charsets(try_aliases);

returns the table of all available Java charsets and the PostgreSQL ones
they could be matched up with, where the boolean try_aliases indicates
whether to try Java's known aliases for a charset when nothing in
PostgreSQL matched its canonical name. False matches happen when
try_aliases is true, so that's not a great idea.
These PostgreSQL notions will have to be available to Java code
for two reasons.

First, even code that has no business poking at them can still need
to know which one is current, to set an appropriate lifetime on
a Java object that corresponds to something in PostgreSQL allocated
in that context or registered to that owner. For that purpose, they
both will be exposed as subtypes of Lifespan, and the existing
PL/Java DualState class will be reworked to accept any Lifespan to
bound the validity of the native state.

Second, Adapter code could very well need to poke at such objects
(MemoryContexts, anyway): either to make a selected one current for
when allocating some object, or even to create and manage one.
Methods for that will not be exposed on MemoryContext or ResourceOwner
proper, but could be protected methods of Adapter, so that only
an Adapter can use them.
In addition to MemoryContextImpl and ResourceOwnerImpl proper, this step
will require reworking DualState so state lives are bounded by Lifespan
instances instead of arbitrary pointer values. Invocation will be made
into yet another subtype of Lifespan, appropriate for the life of an
object passed by PostgreSQL in a call and presumed good while the call
is in progress.

The DualState change will have to be rototilled through all of its
clients. That will take the next several commits.

The DualState.Key requirement that was introduced in 1.5.1 as a way to
force DualState-guarded objects to be constructed only in upcalls from C
(as a hedge against Java code inadvertently doing it on the wrong
thread) will go away. We *want* Adapters to be able to easily construct
things without leaving Java. Just don't do it on the wrong thread.
Never very well publicized upstream, reading the examples of plpgsql,
plperl, and plpython, when using BeginInternalSubTransaction, there
is a certain pattern of saving and restoring the memory context and
resource owner that PL/Java has not been doing.

Now it is easy to implement that.

https://www.postgresql.org/message-id/619EA06D.9070806%40anastigmatix.net
The current invocation can be the right Lifespan to specify for
a DualState that's guarding some object PostgreSQL passed in to
the call, which is expected to be good for as long as the call
is in progress.

In other, but related, news, Invocation can now return the
"upper executor" memory context: that is, whatever context was
current at entry, even if a later use of SPI changes the context
that is current.

It can appear tempting to eliminate the special treatment of PgSavepoint
in Invocation, and simply make it another DualState client, but because
of the strict nesting imposed on savepoints, keeping just the one
reference to the first one set suffices, and is more efficient.
Simplify these: their C callers were passing unconditional null
as the ResourceOwner before, which their Java constructors passed
along unchanged. Now just have the Java constructor pass null
as the Lifespan.
These DualState clients were previously passing the address of
the current invocation struct as their "resource owner", again from
the C code, passed along by the Java constructor. Again simplify
to call Invocation.current() right in the Java constructor and use
that as the Lifespan.

On a side note, the legacy Relation class included here (and its
legacy Tuple and TupleDesc) will naturally be among the first
candidates for retirement when this new model API is ready.
This legacy Portal class is called from C and passed the address
of the PostgreSQL ResourceOwner associated with the Portal itself.
This is only an intermediate refactoring of VarlenaWrapper.
Construction of one is still set in motion from C. Ultimately,
it should implement Datum and be something that a Datum.Accessor
can construct with a minimum of fuss.
Originally a hedge against coding mistakes during the introduction
of DualState for 1.5.1 (which had to support Java < 9), it is less
necessary now that the internals are behind JPMS encapsulation, and
the former checks for the cookie can be replaced with assertions that
the action is happening on the right thread. The CI tests run with
assertions enabled, so this should be adequate.
The commits grouped under this merge add API to expose in Java
the PostgreSQL notions of MemoryContext and ResourceOwner, and then
rework PL/Java's DualState class (which manages objects that combine
some Java state and some native state, and may need specified actions
to occur if the Java state becomes unreachable or explicitly released
or if a lifespan bounding the native state expires). A DualState now
accepts a Lifespan, of which MemoryContext and ResourceOwner are both
subtypes. So is Invocation, an obvious lifespan for things PostgreSQL
passes in that are expected to be valid for the duration of the call.

The remaining commits in this group propagate the changes through
the affected legacy code.
Fitting it into the new scheme is not entirely completed here;
for example, newReadable takes a Datum.Input parameter, but still
casts it internally to VarlenaWrapper.Input. Making it interoperate
with any Datum.Input may be a bit more work.

Likewise, newReadable with synthetic=true still encapsulates all
the knowledge of what datatypes there is synthetic-XML coverage
for and selecting the right VarlenaXMLRenderer for it (there's
that varlena-specificity again!). More of that should be moved
out of here and into an Adapter.

In passing, fix a couple typos in toString() methods, and add
a serviceable, if brute-force, getString() method to Synthetic.
It would be better for SyntheticXMLReader to gain the ability to
produce character-stream output efficiently, but until that
happens, there needs to be something for those moments when you
just want a string to look at and shouldn't have to fuss to get it.

For now, VarlenaWrapper.Input and .Stream still extend, and add small
features like toString(Object) to, DatumImpl. Later work can probably
migrate those bits so VarlenaWrapper will only contain logic specific
to varlenas.

An adt.spi interface Verifier is added, though Datum doesn't yet
expose any way to use it; in this commit, only one method accepting
Verifier.OfStream is added in DatumImpl.Input.Stream, the minimal
change needed to get things working.
As before, JNI methods for this 'model' framework continue to
be grouped together in ModelUtils.c; their total number and
complexity is expected to be low enough for that to be practical,
and then they can all be seen in one place.

RegClassImpl and RegTypeImpl acquire m_tupDescHolder arrays in
this commit, without much explanation; that will come a few commits
later.
There are two flavors so far, Deformed and Heap. Deformed works
with whatever a real PostgreSQL TupleTableSlot can work with,
relying on the PostgreSQL implementation to 'deform' it into
separate datum and isnull arrays. (That doesn't have to be a
PostgreSQL 'virtual' TupleTableSlot; it can do the deforming
independently of the type of slot. When the time comes to
implement the reverse direction and produce tuples, a virtual
slot will be the way to go for that, using the PostgreSQL C code
to 'form' it once populated.)

The Heap flavor knows enough about that PostgreSQL tuple format
to 'deform' it in Java without the JNI calls (except where some
out-of-line value has to be mapped, or for varlena values until
VarlenaWrapper sheds more of its remaining JNI-centricity). The
Heap implementation does not yet do anything clever to memoize
the offsets into the tuple, which makes the retrieval of all
the tuple's values an O(n^2) proposition; there is a
low-hanging-fruit optimization opportunity there. For now, it gets
the job done.

It might be interesting to see how the two flavors compare on
typical heap tuples: Deformed, making more JNI calls but relying
on PostgreSQL's fast native deforming, or Heap, which can avoid
more JNI calls, and also avoids deforming something into a fresh
native memory allocation if the only thing it will be used for is
to immediately construct some Java object.

The Heap flavor can do one thing the Deformed flavor definitely
cannot: it can operate on heap-tuple-formatted contents of an
arbitrary Java byte buffer, which in theory might not even be
backed by native memory. (Again, for now, this is slightly science
fiction where varlena values are concerned, because VarlenaWrapper
retains a lot of its native dependencies. A ByteBuffer "heap tuple"
with varlenas in it will have to be native-backed for now.) The
selection of the DualState guard by heapTupleGetLightSlot() is
currently more hardcoded than that would suggest; it assumes the
buffer is mapping memory that can be heap_free_tuple'd.

The 'light' in heapTupleGetLightSlot really means that there isn't
an underlying PostgreSQL TupleTableSlot constructed.

The whole business of how to apply and use DualState guards on these
things still needs more attention.

There is also Heap.Indexed, which is the thing needed for arrays.
When the element type is fixed-length, it achieves O(1) access
(plus null-bitmap processing if there are nulls). It uses a "count
preceding null bits ahead of time" strategy that could also easily
be adopted in Heap.

A NullableDatum flavor is also needed, which would be the thing for
mapping (as one prominent example) function-call arguments.

The HeapTuples8 and HeapTuples4 classes at the end are scaffolding
and ought to be factored out into something with a decent API, as
hinted at in the comment preceding them.

A Heap instance still inherits the values/nulls array fields used
in the deformed case, without (at present) making any use of them.
It is possible some use could be made (as, again, an underlying PG
TupleTableSlot could be used in deforming a heap tuple), but it's
also possible that won't ever be needed, and the class could be
refactored to a simpler form.
Here's how this is going to work.

The "exists because mentioned" aspect of a CatalogObject is
a lightweight operation, just caching/returning a singleton with
the mentioned values of classId/objId/(subId?).

For a bare CatalogObject (objId unaccompanied by classId), that's
all there is. But for any CatalogObject.Addressed subtype, the
classId and objId together identify a tuple in a particular system
catalog (or, that is, identify a tuple that could exist in that
catalog). And the methods on the Java class that return information
about the object get the information by fetching attributes from
that tuple, then constructing whatever the Java representation
will be.

Not to duplicate the work of fetching (the tuple itself, and then
an attribute from the tuple) and constructing the Java result, an
instance will have an array of SwitchPointCache-managed "slots"
that will cache, lazily, the constructed results. Five of those
slots have their indices standardized right here in CatalogObjectImpl,
to account for the name, namespace, owner, and ACL of objects that
have those things. Slot 0 is for the tuple itself.

When an uncached value is requested, the "computation method" set up
for that slot will execute (always on the PG thread, so it can
interact with PostgreSQL with no extra ceremony). Most computation
methods will begin by calling cacheTuple() to obtain the tuple
itself from slot 0, and then will fetch the wanted attribute from it
and construct the result. The computation method for cacheTuple(),
in turn, will obtain the tuple if that hasn't happened yet, usually
from the PostgreSQL syscache. We copy it to a long-lived memory
context where we can keep it until its invalidation.

The most common way the cacheTuple is fetched is by a one-argument
syscache search by the object's Oid. When that is all that is needed,
the Java class need only implement cacheId() to return the number
of the PostgreSQL syscache to search in. For exceptional cases
(attributes, for example, require a two-argument syscache search),
a class should just provide its own cacheTuple computation method.

The slots for an object are associated with a Java SwitchPoint,
and the mapping from the object to its associated SwitchPoint
is a function supplied to the SwitchPointCache.Builder. Some
classes, such as RegClass and RegType, will allocate a SwitchPoint
per object, and can be selectively invalidated. Otherwise, by
default, the s_globalPoint declared here can be used, which will
invalidate all values of all slots depending on it.
They are the two CatalogObjects with tupleDescriptor() methods.

You can get strictly more tuple descriptors by asking RegType;
a RegType.Blessed can give you a tuple descriptor that has been
interned in the PostgreSQL typcache and corresponds to nothing
in the system catalogs. But whenever a RegType t is an ordinary
cataloged composite type or the row type of a cataloged relation,
then there is a RegClass c such that c == t.relation() and
t == c.type(), and you will get the same tuple descriptor from
the tupleDescriptor() method of either c or t.

In all but one such case, c delegates to c.type().tupleDescriptor()
and lets the RegType do the work, obtaining the descriptor from
the PG typcache.

The one exception is when the tuple descriptor for pg_class itself
is wanted, in which case the RegClass does the work, obtaining the
descriptor from the PG relcache, and RegType delegates to it for
that one exceptional case. The reason is that RegClass will see
the first request for the pg_class tuple descriptor, and before that
is available, c.type() can't be evaluated.

In either case, whichever class looked it up, a cataloged tuple
descriptor is always stored on the RegClass instance, and RegClass
will be responsible for its invalidation if the relation is altered.
(A RegType.Blessed has its own field for its tuple descriptor,
because there is no corresponding RegClass for one of those.)

Because of this close connection between RegClass and RegType,
the methods RegClass.type() and RegType.relation() use a handshake
protocol to ensure that, whenever either method is called, not only
does it cache the result, but its counterpart for that result instance
caches the reverse result, so the connection can later be traversed
in either direction with no need for a lookup by oid.

In the static initializer pattern introduced here, the handful of
SwitchPointCache slots that are predefined in CatalogObject.Addressed
are added to, by starting an int index at Addressed.NSLOTS,
incrementing it to initialize additional slot index constants, then
using its final value to define a new NSLOTS that shadows the original.
An Attribute is most often obtained from a TupleDescriptor
(in this API, that's how it's done), and the TupleDescriptor
can supply a version of Attribute's tuple directly; no need
to look it up anywhere else. That copy, however, cuts off
at ATTRIBUTE_FIXED_PART_SIZE bytes. The most commonly needed
attributes of Attribute are found there, but for others beyond
that cutoff, the full tuple has to be fetched from the syscache.

So AttributeImpl has the normal SLOT_TUPLE slot, used for the
rarely-needed full tuple, and also its own SLOT_PARTIALTUPLE,
for the truncated version obtained from the containing tuple
descriptor. Most computation methods will fetch from the partial
one, with the full one referred to only by the ones that need it.

It doesn't end there. A few critical Attribute properties, byValue,
alignment, length, and type/typmod, are needed to successfully fetch
values from a TupleTableSlotImpl.Heap. So Attribute cannot use that
API to fetch those values. For those, it must hardcode their actual
offsets and sizes in the raw ByteBuffer that the containing tuple
descriptor supplies, and fetch them directly. So there is also
a SLOT_RAWBUFFER.

This may sound more costly in space than it is. The raw buffer,
of course, is just a ByteBuffer sliced off and sharing the larger
one in the TupleDescriptor, and the partial tuple is just a
TupleTableSlot instance built over that. The full tuple is another
complete copy, but only fetched when those less-commonly-needed
attributes are requested.

With those key values obtained from the raw buffer, the Attribute's
name does not require any such contortions, and can be fetched using
the civilized TupleTableSlot API, except it can't be done by name,
so the attribute number is used for that one.

An AttributeImpl.Transient holds a direct reference to
the TupleDescriptor it came from, which its containingTupleDescriptor()
method returns. An AttributeImpl.Cataloged does not, and instead holds
a reference to the RegClass for which it is defined in the system
catalogs, and containingTupleDescriptor() delegates to tupleDescriptor()
on that. If the relation has been altered, that could return an updated
new tuple descriptor.
RegClass is an easy choice, because those invalidations are also
the invalidations of TupleDescriptors, and because it has a nice
API; we are passed the oid of the relation to invalidate, so we
acquire the target in O(1).

(Note in passing: AttributeImpl is built on SwitchPointCache in
the pattern that's emerged for CatalogObjects in general, and an
AttributeImpl.Cataloged uses the SwitchPoint of the RegClass, so
it's clear that all the attributes of the associated tuple
descriptor will do the right thing upon invalidation. In contrast,
TupleDescriptorImpl itself isn't quite built that way, and the
question of just how a TupleDescriptor itself should act after
invalidation hasn't been fully nailed down yet.)

RegType is probably also worth invalidating selectively, as is
probably RegProcedure (procedures are mainly what we're about
in PL/Java. right?), though only RegType is done here.

That API is less convenient; we are passed not the oid but a hash
of the oid, and not the hash that Java uses. The solution here is
brute force, to get an initial working implementation. There are
plenty of opportunities for optimization.

One idea would be to use a subclass of SwitchPoint that would set
a flag, or invoke a Runnable, the first time its guardWithTest
method is called. If that hasn't happened, there is nothing to
invalidate. The Runnable could add the containing object into some
data structure more easily searched by the supplied hash. Transitions
of the data structure between empty and not-empty could be propagated
to a boolean in native memory, where the C callback code could avoid
the Java upcall entirely if there is nothing to do. This commit
contains none of those optimizations.

Factory.invalidateType might be misnamed; it could be syscacheInvalidate
and take the syscache id as another parameter, and then dispatch to
invalidating a RegType or RegProcedure or what have you, as the case
may be.

At least, that would be a more concise implementation than providing
separate Java methods and having the C callback decide which to call.
But if some later optimization is tracking anything-to-invalidate?
separately for them, then the C code might be the efficient place
for the check to be done.

PostgreSQL has a limited number of slots for invalidation callbacks,
and requires a separate registration (using another slot) for each
syscache id for which callbacks are wanted (even though you get
the affected syscache id in the callback?!). It would be antisocial
to grab one for every sort of CatalogObject supported here, so we
will have many relying on CatalogObject.Addressed.s_globalPoint
and some strategy for zapping that every so often. That is not
included in this commit. (The globalPoint exists, but there is
not yet anything that ever zaps it.)

Some imperfect strategy that isn't guaranteed conservative might
be necessary, and might be tolerable (PL/Java has existed for years
with less attention to invalidation). An early idea was to zap the
globalPoint on every transaction or subtransaction boundary, or when
the command counter has been incremented; those are times when
PostgreSQL processes invalidations. However, invalidations are also
processed any time locks are acquired, and that doesn't sound as if
it would be practical to intercept (or as if the resulting behavior
would be practical, even if it could be done).

Another solution approach would just be to expose a zapGlobalPoint
knob as API; if some code wants to be sure it is not seeing something
stale (in any CatalogObject we aren't doing selective invalidation for),
it can just say so before fetching it.
jcflack added 2 commits April 25, 2025 18:01
This DualState subclass used to free the associated tuple table in the
javaStateUnreachable lifespan event; now, only at javaStateReleased.
It turns out that SPI_freetuptable, since postgres/postgres@3d13623, has
contained code to raise a warning if the tuple table being freed does not
belong to the current SPI connection. With the earlier javaStateUnreachable
handling, that warning could be triggered on rare and irksome occasions
when Java's GC happened to find, during a nested invocation of some Java
function, that a tuple table from an outer invocation had become
unreachable.

It would be conceivable to have javaStateUnreachable try to determine
if the current nest level matches that of the tuple table's creation,
and free it if so at least, otherwise leaking it to the exit of the outer
call. But for now it's also conceivable to just do nothing and let
the context reset at invocation exit mop things up.
@jcflack
Copy link
Contributor Author

jcflack commented Apr 26, 2025

A PL/Java-based language can handle columns/expressions of concrete type anyarray

The PostgreSQL type ANYARRAY, normally a polymorphic type that would only be seen in a routine's inputsTemplate or outputsTemplate prior to resolution at an actual call site, can in very particular circumstances be seen even after resolution, in a routine's inputsDesccriptor or outputsDescriptor. It is not normally possible to declare uses of ANYARRAY as a concrete type, but certain columns in PostgreSQL-supplied statistics-related catalog tables are declared that way, and the type will be seen for those columns or expressions involving them.

Such a column will always hold an array, but different rows may hold arrays of different element types. A method on Adapter.Array, elementType(), will supply an Adapter that produces the element type of an ANYARRAY-typed array. Once the element type is known, a suitable Adapter for that type can be chosen, and used to construct an array adapter for access to the array's content.

Dispatcher now supports languages implementing TRANSFORM FOR TYPE

If a PL's implementing class does not implement the UsingTransforms interface, the dispatcher will automatically reject routine declarations (at validation time, or at call time in case such a routine got created while validation couldn't happen) that include TRANSFORM FOR TYPE. The PL implementation does not have to concern itself with that, and this avoids the case where PostgreSQL allows routine declarations with TRANSFORM FOR TYPE for transform-unaware PLs where they will have no effect.

If a PL does implement UsingTransforms, the dispatcher will make sure that any Transform mentioned in a routine declaration for that language satisfies the language's essentialTransformChecks method. Because the fromSQL and toSQL functions of a transform have similar signatures in SQL to other functions that aren't transform functions at all, and PostgreSQL does not prevent CREATE TRANSFORM naming inappropriate functions, the essentialTransformChecks method should make a diligent effort to ensure that any proposed Transformhas fromSQL/toSQL functions the PL will be able to use.

Implementing that UsingTransforms method is only the start of the PL handler's job. The handler is also responsible for the entirety of whatever that PL will do to accomplish the results of the transforms. It will probably begin, in its prepare method, by consulting the memo's transforms() method to learn what transforms, if any, should be applied.

The Glot64 example language handler now contains example code involving transforms.

jcflack added 5 commits May 9, 2025 16:03
An implementation of PLJavaBasedLanguage may also implement
ReturningSets. If it does, its prepareSRF method, not the usual prepare,
will be used when the target RegProcedure's returnsSet() is true.

The prepareSRF method must return an SRFTemplate. Further, what it
returns must also implement one of SRFTemplate's member subinterfaces,
ValuePerCall or Materialize (PostgreSQL might add additional options in
the future).

The base interface, SRFTemplate, has an abstract 'negotiate' method, to
be passed a list of the subinterfaces the caller is prepared to accept,
and return the index of one from the list that the routine will use.
Each subinterface has a default implementation that will find itself in
the caller's list. A class that implements more than one of the
subinterfaces will inherit conflicting defaults, and therefore have to
provide its own 'negotiate' implementation.

The list of interfaces acceptable to the caller is ordered so as to
reflect the caller's preference. A simple negotiate method could return
the index of the first interface in the list that this SRFTemplate
happens to implement. A more sophisticated one might take properties of
the prepared routine into account.

The ValuePerCall interface specifies a specializeValuePerCall method
that is expected to return an SRFFirst. SRFFirst has a firstCall method
that should return an SRFNext instance. SRFNext has a nextResult method
that will be called as many times as necessary. It should use
fcinfo->result / fcinfo->isNull to store result values for one row, like
any non-set-returning function, but return SINGLE, MULTIPLE, or END to
indicate whether another call is expected. SRFNext implements
AutoCloseable, and its close() method will be called after the last call
made to nextResult (which may happen before nextResult returns END, if
PostgreSQL does not need all the results). The case where nextResult
returns SINGLE is an exception, treated as returning only that one row,
and close() will not be called.

The Materialize interface will likewise specify a specializeMaterialize
method, but the details are TBD, so at this stage of the API the
Materialize interface is a stub and does not specify any usable behavior
yet.
CleanupTracker is only used when assertions are enabled, and checks
that entries and exits of 'cleanup' loops (cleanEnqueuedInstances for
Java-released or unreachable instances, nativeRelease for expired
native lifespans) are properly paired. It originally prohibited
any more than one cleanup loop being in progress at any time, and
was relaxed in 95d4133 to allow one of each type, assuming there
would be no good reason for such a loop to be reentered.

The generalization in 5a9cdf9 from "resource owner" to Lifespan, with
a possibly-extensible set of Lifespan objects each heading its own list
of dependents, introduced a realistic possibility that some class that
is itself bounded by a Lifespan could also be a Lifespan for other
classes.

A practical application arises in modeling the PostgreSQL ExprContext.
The chief (perhaps only) use of ExprContext in PL/Java is, via its
callback, to signal when no more output is needed from a ValuePerCall
set-returning function that may not have been read to the end. This is
a use of ExprContext as a Lifespan to bound an SRFRoutine.

PostgreSQL's ExprContext. however, does not invoke its callbacks if
error cleanup is afoot, which (ironically enough) would leave its Java
mirror un-cleaned-up in error cases. That can be addressed by also modeling
that ExprContext itself has a lifespan bounded by its per-query memory
context. So ExprContext both has a Lifespan, and is a Lifespan, and in
error cases where the lifespanRelease of the memory context cascades to
lifespanRelease of the ExprContext, the test made here by CleanupTracker
was too restrictive.

The pattern of a Lifespan with a Lifespan does not seem likely to become
so widespread as to cause frequent or deep reentry of lifespanRelease, and
the present example seems a legitimate and reasonable case, so relax the
CleanupTracker assertion to allow it.

In a related change, don't let exceptions abort lifespanRelease:
cleanEnqueuedInstances was already swallowing exceptions (citing JDK 9
Cleaner as precedent) so they would not prevent processing of later
queue entries, but lifespanRelease did not do the same.  Now it does.

It might be better one day to collect exceptions (perhaps as a suppressed
list) to report after the loop.
ExprContextImpl is in the same o.p.p.pg package as implementations of
API-exposed interfaces in o.p.p.model, but there may be no need for any
such API interface, so this is purely for use in the internal module
for now.

For PL/Java's purposes, the chief use of ExprContext is, via its
callback, to be usable as the Lifespan of a DualState instance
associated with the row-collecting activity of a set-returning function
in ValuePerCall mode, whose nativeStateReleased event can trigger
calling the close() method (and resetting internal dispatcher state)
when PostgreSQL collects fewer rows than the function intends. This use
is all internal to the dispatcher.

A PostgreSQL ExprContext carries other information of interest, such as
the per-query memory context, but there is no need to fastidiously
follow PostgreSQL by having the accessor method for that on ExprContext.
As that memory context is needed before the very first call on
a set-returning function, before it is even known whether a Java mirror
of the ExprContext will need to be constructed, it will be better to
fetch that memory context eagerly and put an accessor for it on
ReturnSetInfoImpl instead.

With that in mind, this model class provides no accessor methods at all,
and simply exists to be used as a Lifespan.
Implementation of Materialize mode has to wait, as the API interface
SRFTemplate.Materialize is still only a stub with details TBD.

As contemplated in the implementation of ExprContextImpl, the per-query
memory context is here made available with an accessor method directly
on ReturnSetInfoImpl. It is passed eagerly up by the C dispatch code
through the Java entry points to be readily available, as it will
certainly be needed when any set-returning function is to be called.
As long as the store direction of TupleTableSlot remains unimplemented,
null / void / zero are still the only values any Glot64 function can
return. But now it can return sets of them!

For a Glot64 set-returning functions, there are stricter limits on the
source string. It must be a base64 string that, when 'compiled' (i.e.,
decoded), is the string representation of a decimal integer.  The
function so defined will return that many rows (of null / void / zero),
ignoring any arguments.

If the source string 'compiles' to a negative integer, a single row is
returned, exercising the SRFNext.Result.SINGLE case.
@jcflack
Copy link
Contributor Author

jcflack commented May 9, 2025

How a PL/Java-based language supports set-returning functions

An implementation of PLJavaBasedLanguage may also implement ReturningSets. If it does, its prepareSRF method, not the usual prepare, will be used when the target RegProcedure's returnsSet() is true. (If a PL/Java-based language does not implement ReturningSets, PL/Java's dispatcher will reject any such RegProcedure at validation and at dispatch time, so a language that does not intend to support set return does not have to concern itself with those details.)

The prepareSRF method must return an SRFTemplate. Further, what it returns must also implement one or more of SRFTemplate's member subinterfaces, ValuePerCall or Materialize (PostgreSQL might add additional options in the future).

The base interface, SRFTemplate, has an abstract negotiate method, to be passed a list of the subinterfaces the caller is prepared to accept, and return the index of one from the list that the routine will use. Each subinterface has a default implementation that will find itself in the caller's list. A class that implements more than one of the subinterfaces will inherit conflicting defaults, and therefore have to provide its own negotiate implementation.

The list of interfaces acceptable to the caller is ordered so as to reflect the caller's preference. A simple negotiate method could return the index of the first interface in the list that this SRFTemplate happens to implement. A more sophisticated one might take properties of the prepared routine into account.

The ValuePerCall interface specifies a specializeValuePerCall method that is expected to return an SRFFirst. SRFFirst has a firstCall method that should return an SRFNext instance. SRFNext has a nextResult method that will be called as many times as necessary. It should use fcinfo.result / fcinfo.isNull to store result values for one row, like any non-set-returning function, but return SINGLE, MULTIPLE, or END to indicate whether another call is expected. SRFNext implements AutoCloseable, and its close() method will be called after the last call made to nextResult (which may happen before nextResult returns END, if PostgreSQL does not need all the results). The case where nextResult returns SINGLE is an exception, treated as returning only that one row, and close() will not be called.

The Materialize interface will likewise specify a specializeMaterialize method, but the details are TBD, so at this stage of the API the Materialize interface is a stub and does not specify any usable behavior yet.

The Glot64 example language handler now has example code for set-returning functions. For as long as the store direction of TupleTableSlot remains unimplemented, null / void / zero are still the only values any Glot64 function can return. But now it can return sets of them!

@beargiles
Copy link

Adding this for the record since I didn't see it mentioned above.

The existing implementation uses JNI. It has many benefits but is a real PITA to work with.

That lead to the creation of 'Java Native Access' (JNA). It is much easier to use since you only need to provide (loosely) the API interface and location of the shared file. It can't do everything that JNI can - but it can do a lot.

This refactoring may want to look at how much of libpq can be implemented using JNA instead of JNI, and whether there would be a performance impact when doing so.

The main benefit of this approach would be 1) reducing the amount of work required by this project and 2) letting the PostgreSQL project maintain this functionality.

Finally it's possible that anything that still requires JNI might be handled in an updated libpq or other library.

see java-native-access/jna for a project that has wrapped a ton of C libraries, including many specific to operating systems or CPU processors.

@jcflack
Copy link
Contributor Author

jcflack commented May 16, 2025

The existing implementation uses JNI. It has many benefits but is a real PITA to work with.

That lead to the creation of 'Java Native Access' (JNA). It is much easier to use since you only need to provide (loosely) the API interface and location of the shared file.

Opportunities surely exist for replacing some JNI with Java's own more recent Foreign Function and Memory API, preserving PL/Java's lack of third-party runtime dependencies. Some of the newer Java code (Datum.Accessor is an example IIRC) is already type-parametrized in order to support a future FFM implementation.

That said, the opportunities for replacing JNI with FFM are limited by the typical need for things to happen at the boundaries (transformation of Java exceptions into ereports and vice versa, for example), already taken care of in Thomas's JNICalls.c. For that reason, a lot of the existing uses of JNI are in no danger of going away.

Those considerations are a bit orthogonal to achieving the goals and correcting the deficiencies described at the top of this PR.

libpq, being a library for processing of the on-the-wire communication between a PostgreSQL backend and a connected client, isn't used in PL/Java and hasn't many facilities that would be of use here.

jcflack added 15 commits May 23, 2025 12:49
Also move the SQLAction creating the pljavahandler language onto
the package declaration; no need of a dummy class for those.
PostgreSQL can sometimes pass an array so empty it has no dimensions.
The type List<String> assigned back in 7786fbf already had a comment
that it probably wasn't appropriate, as each String in the list really
represents a key=value pair and that should be made explicit.

On inspection of the PostgreSQL parser and transformRelOptions function,
it's clear the key is an SQL simple identifier, duplicates forbidden,
and the value is a String and never null, so a null-hostile, unmodifiable
Map<Identifier.Simple,String> fits the bill.

An assumption has to be made that the key is never a delimited identifier
with an '=' in it, so the first '=' in the stored value is unambiguously
the delimiter between the key and the value. That's the same assumption
made by PostgreSQL's untransformRelOptions (though, for the time being[0],
PostgreSQL does not reject a key with '=' in it when accepting unknown
custom options such as for foreign data wrappers).

The same Map<Simple,String> type is appropriate also for
Attribute.options() and Attribute.fdwoptions(), and for similar accessors
on foreign-data-related catalogs when those are implemented.

[0] https://www.postgresql.org/message-id/6830EB30.8090904%40acm.org
There are still others yet to be implemented, but these are easy enough.
Getting the descriptor from the typcache is handy when a relation has
an associated type, but not all kinds of relation have one. An index
or TOAST table doesn't, for example.

There was already one case that had to be handled by going to
the relcache, and that was to get the descriptor for pg_class itself,
which can't be expected to find its associated type before it knows
what its columns are. So that code path just needs to be used also
for the relation kinds that don't have an associated type.
RegClass should have accessors for AccessMethod and Tablespace
(Database has a Tablespace also).

RegClass indirectly reaches ForeignDataWrapper and ForeignServer.
Opting not to make ForeignTable a CatalogObject in its own right:
it barely qualifies. It is identified by the oid of the RegClass,
and functions more as an extension of that. Its options and
ForeignServer can just be given accessors on RegClass.

Accessors returning these things to be added in a later commit
(along with whitespace-only tidying of lines added here).
This commit includes whitespace-only tidying of lines added
in the previous commit.
ForeignTable is simply represented by two accessors added to RegClass.
When foreign-table info is wanted, a little class RegClass holds in
a single slot gets instantiated and constructs both values.

The slot's invalidation still uses the RegClass switch point, rather
than also hooking invalidation for pg_foreign_table.

In passing, catch up with two pg_database attributes that changed
in PG 15 from name to text.
The merged work includes PR #533, which does away with the old Ptr2Long
union in favor of new PointerGetJLong and JlongGet(... conversions.
In merging, also convert the uses of Ptr2Long that were added on this
branch.
This continues the work started in the REL1_6_STABLE branch of fixing
javadoc errors that prevent a successful javadoc run with maximal
coverage. This fixes such errors that have been introduced in
the org.postgresql.pljava.internal module in this branch.
The only well-known collations pinned with compile-time symbols
remaining now are DEFAULT and C (postgres/postgres@51edc4c).
In PG 18, there is now a CompactAttribute struct found in
tuple descriptors (postgres/postgres@5983a4c) that contains
a field of like purpose, so the one in pg_attribute is gone
(postgres/postgres@02a8d0c). PL/Java deforming wasn't making
any use of it yet anyway.

Where a TupleDesc used to have an attrs offset that was exactly
where a sequence of Form_pg_attribute began, it now has a
compact_attrs offset where a sequence of CompactAttribute starts.
That is still followed by a sequence of Form_pg_attribute, so now
the Java code looking for those has to take the compact_attrs offset
and add natts * sizeof (CompactAttribute).

The CompactAttribute structs were added upstream as a performance
optimization, and could perhaps be made use of here to good effect,
but for now just compute the offset and use the Form_pg_attribute
in the accustomed way. Most promising, perhaps, would be to have
TupleDescriptor make use of the attcacheoff member of the new struct.
The earlier work fixing run-busting errors in javadoc comments permits
the javadoc coverage for the o.p.p.internal module to be expanded to
cover package-private types and members.

In passing, add enough missing class-level javadoc comments to make
the resulting package listings somewhat presentable.
Interesting that Mac OS clang was the only compiler to spot it.
This doc comment was overlooked in 5d9836c.
@beargiles
Copy link

My recent PR has the ability to build custom docker images containing the backend jar(s) and then run tests using standard java (spring boot). It is targeted towards end users who want to test their implementations in a "real world" setting.

(Not everyone uses Spring Boot, but enough people do that it's a good place to start since it's then easy to add ORMs etc.)

It should be easy to tweak it to support testing changes to pljava.jar as well. The custom docker image is based on the official postgresql repo but it looks like it would be trivial to change the Dockerfile so it includes a few additional files and the database will now use the local build.

This won't incorporate all of the latest pgxn goodies but it should be great for regression testing since the tests can more closely mimic the real world with ORMs etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants