feat(rpc): compile-time op metadata & RPC graph validation #13167

thevilledev · 2025-04-29T03:50:20Z

Motivation:

When the RPC server receives a compute graph from a client and deserializes it, there's a possibility that the graph structure is malformed. Specifically, nodes might be missing required input operands (src[0], src[1], etc.) for their specified operation (ggml_op). Passing such an incomplete graph directly to ggml_backend_graph_compute could lead to crashes or undefined behavior within the backend implementation.

Changes:

The ggml_op enum and the GGML_OP_METADATA array (which stores metadata like n_src - the number of source operands) are now automatically generated from GGML_OP_LIST by using X-Macros. This ensures they are always synchronized.

A compile-time check function ggml_op_metadata_check() is added and called during ggml_init(). This function uses a switch statement over all ggml_op enum values. If an operation is added to the enum but not to GGML_OP_LIST (and thus not to the metadata), compilers configured to treat unhandled enum cases in a switch as an error (e.g., via -Werror=switch-enum) will flag this at compile time.

For runtime checks this PR introduces a new static helper function validate_graph_operands within ggml-rpc.cpp. This function iterates through the nodes of the deserialized graph before computation begins.

It checks that each node pointer itself is not null.
Based on the ggml_op of each node, it verifies that all required src[i] operand pointers are non-null.
This validation logic is encapsulated in the validate_graph_operands function, which is called from rpc_server::graph_compute immediately after graph deserialization and before calling ggml_backend_graph_compute.

If validation fails, the computation request is rejected early, preventing the invalid graph from reaching the compute backend. The server will output the following:

Accepted client connection, free_mem=17179869184, total_mem=17179869184
[operator()] Graph node 0 (op ADD, name 'malformed') missing required input src[1].
Client connection closed

Performance:

This adds a quick O(N) validation step before the main computation. This preprocessing overhead is expected to be negligible compared to the actual graph computation time but significantly improves the server's robustness against malformed client requests.

The validation could also be done within compute kernels. I rejected the idea because:

It scatters validation logic across potentially many different operation implementations, making maintenance harder.
Adding checks (like if (src == nullptr)) directly within performance-critical compute kernels could introduce branching overhead and potentially hinder optimizations. Potentially impacting overall computation performance.
The current pre-validation approach provides a clear separation of concerns (RPC request validation vs. core computation) and adds minimal overhead outside the critical compute path.

ggerganov · 2025-04-29T06:28:33Z

We have to find a better solution - this is difficult to maintain because nothing guarantees that changes to the srcs of an op would be reflected here. There should be some compile-time guarantee for that.

thevilledev · 2025-04-29T09:44:00Z

Good idea 👍 My initial thoughts revolved around utilising code generation, but how about this:

Create a metadata structure:

         typedef struct {
            int n_src; // Number of required source operands
            // Potentially add other useful metadata later if needed
         } ggml_op_metadata_t;

Declare a static const array indexed by ggml_op. Compiler complains if an enum value is missed. Something like this where n_src represents the number of required computational sources to forward pass:

        static const ggml_op_metadata_t GGML_OP_METADATA[GGML_OP_COUNT] = {
            [GGML_OP_NONE]                  = {.n_src = 0},
            [GGML_OP_DUP]                   = {.n_src = 1},
            [GGML_OP_ADD]                   = {.n_src = 2},
            [GGML_OP_ADD1]                  = {.n_src = 2},
            ...

Compile-time checks using static_assert, or alternatively a runtime check in ggml_init:

static_assert(sizeof(GGML_OP_METADATA) / sizeof(GGML_OP_METADATA[0]) == GGML_OP_COUNT,
                     "GGML_OP_METADATA array size mismatch with GGML_OP_COUNT");

Accessor:

        static inline int ggml_op_get_n_src(enum ggml_op op) {
            if (op >= 0 && op < GGML_OP_COUNT) {
                return GGML_OP_METADATA[op].n_src;
            }
            // Handle invalid op case if necessary, e.g., return -1
            return -1; // Or assert/abort
        }

I'll park this as a draft in the meanwhile.

Refactor `ggml_op` and `GGML_OP_METADATA` using X-Macros. This ensures compile-time synchronization between the enum and metadata. `ggml_op_metadata_check()` verifies this at compile time during `ggml_init`. This enables robust graph validation in the RPC server. Previously, malformed graphs (e.g., ADD with NULL src[1]) could cause crashes. `validate_graph_operands` now uses the X-Macro-generated metadata (`ggml_op_get_n_src`) to check for required non-null source operands before execution. Invalid graphs are rejected early. Adds `test_op_metadata_counts` to verify the metadata system. Signed-off-by: Ville Vesilehto <[email protected]>

thevilledev · 2025-05-19T04:17:33Z

Added compile-time checks through X-macros. Looks like the build is failing on some platforms. I think the struct initialiser can be modified to fix this though.

Early feedback on the approach would be much appreciated!

ggerganov · 2025-05-19T06:07:40Z

This seems like a lot of complexity. The simpler approach from security perspective is to not have a unauthorized clients to submit graphs freely to the server. Then this validation would not be needed.

thevilledev · 2025-05-19T06:35:41Z

I think it's not only about malicious input but incomplete graphs due to programmatic errors, which then end up crashing the RPC server in the process.

ggerganov · 2025-05-19T06:50:10Z

Programmatically we should not allow to create graphs that are incomplete. If instead this was a correct assumption, it would require all backends to validate the graphs which is not practical.

ggml_tensor currently does expose the srcs publicly, which is not great and technicallly allows developers to do many wrong things if they fiddle with these tensors directly from the user code. But this is not something that we should guard at the backend level. Instead, we have to rework the ggml / ggml_tensor interface to make it impossible to do this.

thevilledev · 2025-05-19T07:08:27Z

Sure, I think that's a good improvement for anyone utilising the C++ interface. But you could still send incomplete input with valid wire format over the RPC interface, intentionally or not, without server-side validation present. Sounds like that's an accepted risk so I'll close this PR.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 29, 2025

thevilledev marked this pull request as draft April 29, 2025 09:44

thevilledev force-pushed the fix/graph-op-validate branch from 86182fb to 1b7d972 Compare May 19, 2025 03:59

github-actions bot added the testing Everything test related label May 19, 2025

thevilledev changed the title ~~fix(rpc): validate graph operands~~ feat(rpc): compile-time op metadata & RPC graph validation May 19, 2025

thevilledev closed this May 19, 2025

thevilledev deleted the fix/graph-op-validate branch May 19, 2025 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rpc): compile-time op metadata & RPC graph validation #13167

feat(rpc): compile-time op metadata & RPC graph validation #13167

Uh oh!

thevilledev commented Apr 29, 2025 •

edited

Loading

Uh oh!

ggerganov commented Apr 29, 2025

Uh oh!

thevilledev commented Apr 29, 2025

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025 •

edited

Loading

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

Uh oh!

feat(rpc): compile-time op metadata & RPC graph validation #13167

feat(rpc): compile-time op metadata & RPC graph validation #13167

Uh oh!

Conversation

thevilledev commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 29, 2025

Uh oh!

thevilledev commented Apr 29, 2025

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

ggerganov commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thevilledev commented May 19, 2025

Uh oh!

Uh oh!

thevilledev commented Apr 29, 2025 •

edited

Loading

ggerganov commented May 19, 2025 •

edited

Loading