Skip to content

CNS API contracts for NUMA-Aware Pods #3825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pjohnst5
Copy link
Contributor

@pjohnst5 pjohnst5 commented Jul 14, 2025

CNS IBDevice API Contracts

Overview

Two new APIs for managing InfiniBand (IB) devices in Azure Container Network Service (CNS):

  1. POST IB Devices for Pod - POST IB devices for a pod
  2. GET IB Device Info - GET IB device status

API 1: POST IB Devices for Pod

Endpoint

POST /ibdevices/pod

Request Body

{
  "podID": "my-pod-my-namespace",
  "macAddresses": ["60:45:bd:a4:b5:7a", "7c:1e:52:07:11:36"]
}

Success Response

{
  "errorCode": 0,
  "message": "Successfully assigned 2 devices to pod my-pod-my-namespace"
}

Error Response

{
  "errorCode": 2,
  "message": "Device 60:45:bd:a4:b5:7a is already assigned to another pod"
}

See errorCodes.go in this PR for full list of error codes.


API 2: GET IB Device Information

Endpoint

GET /ibdevices/{mac-address-of-device}

Request

No request body (MAC address provided in URL path)

Success Response

{
  "macAddress": "60:45:bd:a4:b5:7a",
  "podID": "my-pod-my-namespace", 
  "status": "programmed",
  "errorCode": 0,
  "message": ""
}

Device Not Found Response

{
  "deviceID": "60:45:bd:a4:b5:7a",
  "podID": "",
  "status": "",
  "errorCode": 3,
  "message": "Device not found"
}

See status.go in this PR for list of all statuses of an IB device


Request body

See api.go for the structs

  • AssignIBDevicesToPodRequest
  • AssignIBDevicesToPodResponse
  • GetIBDeviceInfoRequest
  • GetIBDeviceInfoResponse

@pjohnst5 pjohnst5 force-pushed the pjohnst5/cns-api-contracts branch 2 times, most recently from 71fa67a to 0564b6d Compare July 15, 2025 00:25
@pjohnst5 pjohnst5 force-pushed the pjohnst5/cns-api-contracts branch 2 times, most recently from 7b6d80b to 82b11b0 Compare July 23, 2025 16:47
@pjohnst5 pjohnst5 force-pushed the pjohnst5/cns-api-contracts branch from c9a2fde to 37cf8f4 Compare July 30, 2025 16:16
@pjohnst5 pjohnst5 marked this pull request as ready for review July 30, 2025 16:18
@Copilot Copilot AI review requested due to automatic review settings July 30, 2025 16:18
@pjohnst5 pjohnst5 requested a review from a team as a code owner July 30, 2025 16:18
@pjohnst5 pjohnst5 requested a review from neaggarwMS July 30, 2025 16:18
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces two new REST API endpoints for managing InfiniBand (IB) devices in Azure Container Network Service (CNS), enabling NUMA-aware pod device assignment.

  • Adds PUT endpoint to assign specific IB devices to pods via MAC addresses
  • Adds GET endpoint to query the status and assignment of individual IB devices
  • Defines error codes and device status enums for the IB device management system

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
cns/types/infiniband/status.go Defines device status constants for IB device lifecycle states
cns/types/infiniband/errorcodes.go Defines error codes for IB device operations
cns/swagger.yaml Adds OpenAPI specification for the two new IB device endpoints
cns/api.go Defines Go structs for API request/response contracts
Comments suppressed due to low confidence (1)

cns/swagger.yaml:186

  • The schema name 'GetIBDeviceStatusResponse' is inconsistent with the endpoint naming convention. Based on the API path '/ibdevices/{macaddress}', it should be 'GetIBDeviceInfoResponse' to match the PR description and maintain consistency.
                $ref: "#/components/schemas/GetIBDeviceStatusResponse"

@pjohnst5 pjohnst5 requested review from a team as code owners July 30, 2025 16:20
@pjohnst5 pjohnst5 requested a review from QxBytes July 30, 2025 16:20
@pjohnst5
Copy link
Contributor Author

Waiting for #3876 to merge first before merging this

@pjohnst5 pjohnst5 force-pushed the pjohnst5/cns-api-contracts branch from d193495 to c70fc2c Compare July 30, 2025 18:45
@pjohnst5 pjohnst5 force-pushed the pjohnst5/cns-api-contracts branch from dd872cf to ac97074 Compare July 30, 2025 20:18
Success ErrorCode = 0
PodNotFound ErrorCode = 1
DeviceUnavailable ErrorCode = 2
DeviceNotFound ErrorCode = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would we use DeviceNotFound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if DNC goes to try and program the device, but, in pubsub, it's not there (like a mismatch between the physical devices on the VM and what's in pubsub), DNC-RC could write "DeviceNotFound" in the CRD status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment above on which error codes would be returned for POST pod api & which ones to be used for IBDevice status, i.e. first list the pod api errorcodes followed by ib device status, since they are being returned/used for different purpose
`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, here's a rough draft of which code would be used for what:

const (
	Success                  ErrorCode = 0 // POST and GET
	PodNotFound              ErrorCode = 1 // POST
	DeviceUnavailable        ErrorCode = 2 // POST
	DeviceNotFound           ErrorCode = 3 // GET
	AnnotationNotFound       ErrorCode = 4 // POST
	PodAlreadyAllocated      ErrorCode = 5 // POST
	InternalProgrammingError ErrorCode = 6 // GET
)

Do you want me to simply move shuffle the lines of the error codes around? Or make separate error code enumerations for POST errors vs GET?

DeviceUnavailable ErrorCode = 2
DeviceNotFound ErrorCode = 3
AnnotationNotFound ErrorCode = 4
PodAlreadyAllocated ErrorCode = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we determine this? should this be idempotent, or if we want to return that pod-request is already in-progress let's return that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if NRM tries to assign the same pod twice, but with different devicd IDs the 2nd time, we want to return this

This basically means, "hey for this pod, we've already allocated devices, and the current devices you are trying to assign are different than we've already programmed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if they send request for the same pod-id & with same devices - do we want to ignore an idempotent request or what will our response be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say our response would be 200 so yeah idempotent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to name this "ConflictingPodInfo" or something like that? (or "ConflictingDeviceInfo"?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try and omit error codes entirely (because you already have the HTTP ones). PodAlreadyAllocated could be covered by returning HTTP/409 with an error message, DeviceNotFound is simply a 404, etc.
aside, PodAlreadyAllocated is the wrong naming because the Pod is not allocated, the IBDev is.

Copy link
Contributor Author

@pjohnst5 pjohnst5 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see your point about trying to get rid of error codes entirely
However, how would we distinguish between PodNotFound vs DeviceNotFound?

the Pod is not allocated, the IBDev is

The case I'm trying to cover here is more like "Pod A has already been assigned devices B and C, but you're now trying to assign it devices D and E"

Copy link
Contributor

@kmurudi kmurudi Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not Conflicting, maybe something that says pod is being allocated or in-progress, yeah whether devices given are same or different - if the given pod-id is in-progress we should return that & not take a new request for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see your point about trying to get rid of error codes entirely However, how would we distinguish between PodNotFound vs DeviceNotFound?

the Pod is not allocated, the IBDev is

The case I'm trying to cover here is more like "Pod A has already been assigned devices B and C, but you're now trying to assign it devices D and E"

I think this smells because the api structure isn't mapping to the objects conceptually right, left a comment on the swagger with more thoughts.

// AssignIBDevicesToPodRequest represents the request to assign InfiniBand devices to a pod
// POST /ibdevices/pod
type AssignIBDevicesToPodRequest struct {
PodID string `json:"podID"` // podname-podnamespace format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just check if the format is usable everywhere in your methods later, i.e. easy to decipher from here & then use it for checking with api-server if pod exists etc. we could also change this later if you discover it doesn't work & is efficient in other format like body/struct or separate fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, in controller-runtime, I see they do "podnamespace/podname"
image

But seeing as this will be part of the API path on the GET, and also that this API is per pod, I'd rather lead with the podname (and make it "-")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the accepted convention is namespaced name, you should be adapting your implementation to that convention. Please don't do your own different quirky thing just because it's convenient, that's how we (already) end up with a dozen different Pod "IDs" in different parts CNS state that are irreconcilable.
It's generally more parseable and makes more sense when lexicographically sorted if you order by least -> most specificity. Pod is more specific than namespace, so it should go after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rbtr , what would you say is best at this point in CNS, to start moving towards a single consolidated PodInfo way of identifying a pod?
I see in

There's a PodInfo struct, should I use that here as well?

So

PodID PodInfo `json:podID"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your question on using PodInfo, do you currently have a problem if a Pod is deleted and a new one immediately created with same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, here's how we've designed it:

  1. Pod A is created
  2. MTPNC is also created for it with a finalizer
  3. Devices are programed, pod runs etc.
  4. Pod is deleted
  5. MTPNC is only deleted, after all devices for that pod were unprogrammed (this will be enforced in our other repo)

So now I'm wondering, is the pod also kept alive, while the MTPNC for it is?
If yes:

  • We're good, since the new pod can't be created until the old one is fully deleted
    If no:
  • Then yeah we'd have a problem haha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just a finalizer or an ownerref?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalizer
finalizers.acn.azure.com/dnc-operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, the pod may be deleted, but the mtpnc for it may still be around
And then, if a new pod with the exact same name in the same namespace were to be created right after, then we'd have a problem of the old mtpnc still existing

What was your point then regarding the pod info?

// GET /ibdevices/{mac-address-of-device}
// GetIBDeviceStatusResponse represents the response containing InfiniBand device programming status
type GetIBDeviceStatusResponse struct {
MACAddress string `json:"macAddress"` // MAC address of the device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MACAddress string `json:"macAddress"` // MAC address of the device
PodID string `json:"podID"` // Pod that the device is assigned to
Status infiniband.Status `json:"status"` // Device status (e.g., "pendingProgramming", "error", "programmed", "pendingDeletion", "available")
ErrorCode infiniband.ErrorCode `json:"errorCode"` // Error code if applicable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the value of an errorCode instead of an error string? only adds an extra step to debugging imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the partner team wanted us to give back error codes, so that when issues arose (let's say PodNotFound vs DeviceNotFound, they could try some extra steps on their end automatically to fix it, instead of involving DRIs

@@ -120,6 +120,69 @@ paths:
schema:
$ref: "#/components/schemas/UnpublishNetworkContainerResponse"

/ibdevices/pod:
Copy link
Contributor

@rbtr rbtr Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stared at this for a long time trying to figure out what bothered me about it and why it didn't seem like good REST - I thought it was because we must POST here but GET at /{mac}?... and I want to be able to GET the asset I have just created/updated.
But now I think it's simply that pod is not an ibdevice. The collection doesn't make sense - a bucket of ibdevices where I can get by mac but post by pod?
https://github.com/microsoft/api-guidelines/blob/vNext/graph/Guidelines-deprecated.md#93-collection-url-patterns

instead maybe:
/ibdevices/{mac} is the collection and items, can GET
/ibdevices/{mac}/pod is the mapping of ibdev:pod, can GET or POST to set associated Pod

or inverted (but a collection of pods seems odd since we don't control those and it's not all of them)
/pods/{pod} collection of pods, can GET
/pods/{pod}/ibdevices mapped IB devices, can GET or POST

or maybe both, as long as they are bidirectionally linked and traversable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, interesting, okay yeah, here's the main thing we are trying to allow

  1. Let's say, IB device A and B are assigned to Pod A
  2. Let's say everything is programmed and Pod A is running
  3. Pod A gets deleted
  4. DNC successfully unprograms IB Device A, but not B
  5. IB Device B is stuck completely due to some bug

In that case, we want to free up IB Device A, so that a new pod can use it while we try to fix IB Device B (or even if unprogramming device B is taking a long time)

But now I'm thinking about what you said

Copy link
Contributor Author

@pjohnst5 pjohnst5 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's kind of weird, because in this case, a pod can have multiple IBs, but we want to treat each IB device as independent in a way, in case something goes wrong (or is slow), to use the free IB devices as soon possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's imagine for a minute we did the first way you said:
/ibdevices/{mac}
/ibdevices/{mac}/pod

In this case, how would the partner team, assign multiple IB devices to a single pod?

And then, let's imagine the other way:
/pods/{pod}
/pods/{pod}/ibdevices
This way I see how we could assign multiple IB devices to a single pod
But, like you said, we don't own all the pods nor control them, so it's just a bit odd
And then back to that situation where one IB device is slower to unprogram than another, we want to be able to get that info back to the partner team quickly, so in this case, the partner team would GET on /pod/{pod}/ibdevice, but have to remember the pod I guess, which would mean they would have to keep track of which pod they assigned it to or something like that 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically, when discussing this, they said:
"Oh yeah, we'll want to know the status of individual devices, to be able to use them for other pods easily, so let's make it GET on mac"
Just trying to relay that info here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is all manageable with query params on an ibdevices collection
GET /ibdevices?state=assigned to find all the assigned ibdevices
GET /ibdevices?state=unassigned unassigned
GET /ibdevices?by-pod=podnamespacedname
POST /ibdevices:assign?ibdevs=mac1,mac2&pod=podnamespacedname

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, one thing I should mention, is what the client of these APIs is doing

The client is going to scan for the MACs on the L1VH VM on startup, and choose devices that belong to the same NUMA nodes (within the L1VH VM)

CNS will not know which devices are part of the same NUMA nodes, the partner team's client will know that

So we weren't planning on having CNS scan for all devices on startup as well, since the client would be doing that

That being said, I do see a potential problem, where the client could give CNS garbage (like a non-existent device id), and CNS would just pass that along to DNC

DNC would be the one that discovers the device does not exist, and have to trickle that knowledge back to CNS

So I think with what you just shared, you're assuming CNS will know all the devices on the L1VH at any given point in time, is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants