Skip to content

feat(rpc): compile-time op metadata & RPC graph validation #13167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thevilledev
Copy link
Contributor

@thevilledev thevilledev commented Apr 29, 2025

Motivation:

When the RPC server receives a compute graph from a client and deserializes it, there's a possibility that the graph structure is malformed. Specifically, nodes might be missing required input operands (src[0], src[1], etc.) for their specified operation (ggml_op). Passing such an incomplete graph directly to ggml_backend_graph_compute could lead to crashes or undefined behavior within the backend implementation.

Changes:

The ggml_op enum and the GGML_OP_METADATA array (which stores metadata like n_src - the number of source operands) are now automatically generated from GGML_OP_LIST by using X-Macros. This ensures they are always synchronized.

A compile-time check function ggml_op_metadata_check() is added and called during ggml_init(). This function uses a switch statement over all ggml_op enum values. If an operation is added to the enum but not to GGML_OP_LIST (and thus not to the metadata), compilers configured to treat unhandled enum cases in a switch as an error (e.g., via -Werror=switch-enum) will flag this at compile time.

For runtime checks this PR introduces a new static helper function validate_graph_operands within ggml-rpc.cpp. This function iterates through the nodes of the deserialized graph before computation begins.

  • It checks that each node pointer itself is not null.
  • Based on the ggml_op of each node, it verifies that all required src[i] operand pointers are non-null.
  • This validation logic is encapsulated in the validate_graph_operands function, which is called from rpc_server::graph_compute immediately after graph deserialization and before calling ggml_backend_graph_compute.

If validation fails, the computation request is rejected early, preventing the invalid graph from reaching the compute backend. The server will output the following:

Accepted client connection, free_mem=17179869184, total_mem=17179869184
[operator()] Graph node 0 (op ADD, name 'malformed') missing required input src[1].
Client connection closed

Performance:

This adds a quick O(N) validation step before the main computation. This preprocessing overhead is expected to be negligible compared to the actual graph computation time but significantly improves the server's robustness against malformed client requests.

The validation could also be done within compute kernels. I rejected the idea because:

  • It scatters validation logic across potentially many different operation implementations, making maintenance harder.
  • Adding checks (like if (src == nullptr)) directly within performance-critical compute kernels could introduce branching overhead and potentially hinder optimizations. Potentially impacting overall computation performance.
  • The current pre-validation approach provides a clear separation of concerns (RPC request validation vs. core computation) and adds minimal overhead outside the critical compute path.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 29, 2025
@ggerganov
Copy link
Member

We have to find a better solution - this is difficult to maintain because nothing guarantees that changes to the srcs of an op would be reflected here. There should be some compile-time guarantee for that.

@thevilledev
Copy link
Contributor Author

Good idea 👍 My initial thoughts revolved around utilising code generation, but how about this:

  • Create a metadata structure:
         typedef struct {
            int n_src; // Number of required source operands
            // Potentially add other useful metadata later if needed
         } ggml_op_metadata_t;
  • Declare a static const array indexed by ggml_op. Compiler complains if an enum value is missed. Something like this where n_src represents the number of required computational sources to forward pass:
        static const ggml_op_metadata_t GGML_OP_METADATA[GGML_OP_COUNT] = {
            [GGML_OP_NONE]                  = {.n_src = 0},
            [GGML_OP_DUP]                   = {.n_src = 1},
            [GGML_OP_ADD]                   = {.n_src = 2},
            [GGML_OP_ADD1]                  = {.n_src = 2},
            ...
  • Compile-time checks using static_assert, or alternatively a runtime check in ggml_init:
static_assert(sizeof(GGML_OP_METADATA) / sizeof(GGML_OP_METADATA[0]) == GGML_OP_COUNT,
                     "GGML_OP_METADATA array size mismatch with GGML_OP_COUNT");
  • Accessor:
        static inline int ggml_op_get_n_src(enum ggml_op op) {
            if (op >= 0 && op < GGML_OP_COUNT) {
                return GGML_OP_METADATA[op].n_src;
            }
            // Handle invalid op case if necessary, e.g., return -1
            return -1; // Or assert/abort
        }

I'll park this as a draft in the meanwhile.

@thevilledev thevilledev marked this pull request as draft April 29, 2025 09:44
Refactor `ggml_op` and `GGML_OP_METADATA` using X-Macros.
This ensures compile-time synchronization between the enum
and metadata. `ggml_op_metadata_check()` verifies this
at compile time during `ggml_init`.

This enables robust graph validation in the RPC server.
Previously, malformed graphs (e.g., ADD with NULL src[1])
could cause crashes. `validate_graph_operands` now uses
the X-Macro-generated metadata (`ggml_op_get_n_src`) to check
for required non-null source operands before execution.
Invalid graphs are rejected early.

Adds `test_op_metadata_counts` to verify the metadata system.

Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev thevilledev force-pushed the fix/graph-op-validate branch from 86182fb to 1b7d972 Compare May 19, 2025 03:59
@github-actions github-actions bot added the testing Everything test related label May 19, 2025
@thevilledev thevilledev changed the title fix(rpc): validate graph operands feat(rpc): compile-time op metadata & RPC graph validation May 19, 2025
@thevilledev
Copy link
Contributor Author

Added compile-time checks through X-macros. Looks like the build is failing on some platforms. I think the struct initialiser can be modified to fix this though.

Early feedback on the approach would be much appreciated!

@ggerganov
Copy link
Member

This seems like a lot of complexity. The simpler approach from security perspective is to not have a unauthorized clients to submit graphs freely to the server. Then this validation would not be needed.

@thevilledev
Copy link
Contributor Author

I think it's not only about malicious input but incomplete graphs due to programmatic errors, which then end up crashing the RPC server in the process.

@ggerganov
Copy link
Member

ggerganov commented May 19, 2025

Programmatically we should not allow to create graphs that are incomplete. If instead this was a correct assumption, it would require all backends to validate the graphs which is not practical.

ggml_tensor currently does expose the srcs publicly, which is not great and technicallly allows developers to do many wrong things if they fiddle with these tensors directly from the user code. But this is not something that we should guard at the backend level. Instead, we have to rework the ggml / ggml_tensor interface to make it impossible to do this.

@thevilledev
Copy link
Contributor Author

Sure, I think that's a good improvement for anyone utilising the C++ interface. But you could still send incomplete input with valid wire format over the RPC interface, intentionally or not, without server-side validation present. Sounds like that's an accepted risk so I'll close this PR.

@thevilledev thevilledev deleted the fix/graph-op-validate branch May 19, 2025 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants