-
Notifications
You must be signed in to change notification settings - Fork 250
CNS API contracts for NUMA-Aware Pods #3825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
71fa67a
to
0564b6d
Compare
7b6d80b
to
82b11b0
Compare
c9a2fde
to
37cf8f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces two new REST API endpoints for managing InfiniBand (IB) devices in Azure Container Network Service (CNS), enabling NUMA-aware pod device assignment.
- Adds PUT endpoint to assign specific IB devices to pods via MAC addresses
- Adds GET endpoint to query the status and assignment of individual IB devices
- Defines error codes and device status enums for the IB device management system
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
cns/types/infiniband/status.go | Defines device status constants for IB device lifecycle states |
cns/types/infiniband/errorcodes.go | Defines error codes for IB device operations |
cns/swagger.yaml | Adds OpenAPI specification for the two new IB device endpoints |
cns/api.go | Defines Go structs for API request/response contracts |
Comments suppressed due to low confidence (1)
cns/swagger.yaml:186
- The schema name 'GetIBDeviceStatusResponse' is inconsistent with the endpoint naming convention. Based on the API path '/ibdevices/{macaddress}', it should be 'GetIBDeviceInfoResponse' to match the PR description and maintain consistency.
$ref: "#/components/schemas/GetIBDeviceStatusResponse"
Waiting for #3876 to merge first before merging this |
d193495
to
c70fc2c
Compare
dd872cf
to
ac97074
Compare
Success ErrorCode = 0 | ||
PodNotFound ErrorCode = 1 | ||
DeviceUnavailable ErrorCode = 2 | ||
DeviceNotFound ErrorCode = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would we use DeviceNotFound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, if DNC goes to try and program the device, but, in pubsub, it's not there (like a mismatch between the physical devices on the VM and what's in pubsub), DNC-RC could write "DeviceNotFound" in the CRD status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we comment above on which error codes would be returned for POST pod api & which ones to be used for IBDevice status, i.e. first list the pod api errorcodes followed by ib device status, since they are being returned/used for different purpose
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, here's a rough draft of which code would be used for what:
const (
Success ErrorCode = 0 // POST and GET
PodNotFound ErrorCode = 1 // POST
DeviceUnavailable ErrorCode = 2 // POST
DeviceNotFound ErrorCode = 3 // GET
AnnotationNotFound ErrorCode = 4 // POST
PodAlreadyAllocated ErrorCode = 5 // POST
InternalProgrammingError ErrorCode = 6 // GET
)
Do you want me to simply move shuffle the lines of the error codes around? Or make separate error code enumerations for POST errors vs GET?
DeviceUnavailable ErrorCode = 2 | ||
DeviceNotFound ErrorCode = 3 | ||
AnnotationNotFound ErrorCode = 4 | ||
PodAlreadyAllocated ErrorCode = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we determine this? should this be idempotent, or if we want to return that pod-request is already in-progress let's return that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, if NRM tries to assign the same pod twice, but with different devicd IDs the 2nd time, we want to return this
This basically means, "hey for this pod, we've already allocated devices, and the current devices you are trying to assign are different than we've already programmed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if they send request for the same pod-id & with same devices - do we want to ignore an idempotent request or what will our response be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say our response would be 200
so yeah idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to name this "ConflictingPodInfo" or something like that? (or "ConflictingDeviceInfo"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try and omit error codes entirely (because you already have the HTTP ones). PodAlreadyAllocated
could be covered by returning HTTP/409 with an error message, DeviceNotFound
is simply a 404, etc.
aside, PodAlreadyAllocated is the wrong naming because the Pod is not allocated, the IBDev is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I see your point about trying to get rid of error codes entirely
However, how would we distinguish between PodNotFound
vs DeviceNotFound
?
the Pod is not allocated, the IBDev is
The case I'm trying to cover here is more like "Pod A has already been assigned devices B and C, but you're now trying to assign it devices D and E"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not Conflicting, maybe something that says pod is being allocated or in-progress, yeah whether devices given are same or different - if the given pod-id is in-progress we should return that & not take a new request for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I see your point about trying to get rid of error codes entirely However, how would we distinguish between
PodNotFound
vsDeviceNotFound
?the Pod is not allocated, the IBDev is
The case I'm trying to cover here is more like "Pod A has already been assigned devices B and C, but you're now trying to assign it devices D and E"
I think this smells because the api structure isn't mapping to the objects conceptually right, left a comment on the swagger with more thoughts.
// AssignIBDevicesToPodRequest represents the request to assign InfiniBand devices to a pod | ||
// POST /ibdevices/pod | ||
type AssignIBDevicesToPodRequest struct { | ||
PodID string `json:"podID"` // podname-podnamespace format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just check if the format is usable everywhere in your methods later, i.e. easy to decipher from here & then use it for checking with api-server if pod exists etc. we could also change this later if you discover it doesn't work & is efficient in other format like body/struct or separate fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the accepted convention is namespaced name, you should be adapting your implementation to that convention. Please don't do your own different quirky thing just because it's convenient, that's how we (already) end up with a dozen different Pod "IDs" in different parts CNS state that are irreconcilable.
It's generally more parseable and makes more sense when lexicographically sorted if you order by least -> most specificity. Pod is more specific than namespace, so it should go after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rbtr , what would you say is best at this point in CNS, to start moving towards a single consolidated PodInfo
way of identifying a pod?
I see in
type PodInfo interface { |
There's a PodInfo
struct, should I use that here as well?
So
PodID PodInfo `json:podID"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your question on using PodInfo, do you currently have a problem if a Pod is deleted and a new one immediately created with same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, here's how we've designed it:
- Pod A is created
- MTPNC is also created for it with a finalizer
- Devices are programed, pod runs etc.
- Pod is deleted
- MTPNC is only deleted, after all devices for that pod were unprogrammed (this will be enforced in our other repo)
So now I'm wondering, is the pod also kept alive, while the MTPNC for it is?
If yes:
- We're good, since the new pod can't be created until the old one is fully deleted
If no: - Then yeah we'd have a problem haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just a finalizer or an ownerref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finalizer
finalizers.acn.azure.com/dnc-operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes, the pod may be deleted, but the mtpnc for it may still be around
And then, if a new pod with the exact same name in the same namespace were to be created right after, then we'd have a problem of the old mtpnc still existing
What was your point then regarding the pod info?
// GET /ibdevices/{mac-address-of-device} | ||
// GetIBDeviceStatusResponse represents the response containing InfiniBand device programming status | ||
type GetIBDeviceStatusResponse struct { | ||
MACAddress string `json:"macAddress"` // MAC address of the device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MACAddress string `json:"macAddress"` // MAC address of the device | ||
PodID string `json:"podID"` // Pod that the device is assigned to | ||
Status infiniband.Status `json:"status"` // Device status (e.g., "pendingProgramming", "error", "programmed", "pendingDeletion", "available") | ||
ErrorCode infiniband.ErrorCode `json:"errorCode"` // Error code if applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the value of an errorCode instead of an error string? only adds an extra step to debugging imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so the partner team wanted us to give back error codes, so that when issues arose (let's say PodNotFound
vs DeviceNotFound
, they could try some extra steps on their end automatically to fix it, instead of involving DRIs
@@ -120,6 +120,69 @@ paths: | |||
schema: | |||
$ref: "#/components/schemas/UnpublishNetworkContainerResponse" | |||
|
|||
/ibdevices/pod: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stared at this for a long time trying to figure out what bothered me about it and why it didn't seem like good REST - I thought it was because we must POST here but GET at /{mac}?... and I want to be able to GET the asset I have just created/updated.
But now I think it's simply that pod
is not an ibdevice
. The collection doesn't make sense - a bucket of ibdevices where I can get by mac but post by pod?
https://github.com/microsoft/api-guidelines/blob/vNext/graph/Guidelines-deprecated.md#93-collection-url-patterns
instead maybe:
/ibdevices/{mac}
is the collection and items, can GET
/ibdevices/{mac}/pod
is the mapping of ibdev:pod, can GET or POST to set associated Pod
or inverted (but a collection of pods seems odd since we don't control those and it's not all of them)
/pods/{pod}
collection of pods, can GET
/pods/{pod}/ibdevices
mapped IB devices, can GET or POST
or maybe both, as long as they are bidirectionally linked and traversable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, interesting, okay yeah, here's the main thing we are trying to allow
- Let's say, IB device A and B are assigned to Pod A
- Let's say everything is programmed and Pod A is running
- Pod A gets deleted
- DNC successfully unprograms IB Device A, but not B
- IB Device B is stuck completely due to some bug
In that case, we want to free up IB Device A, so that a new pod can use it while we try to fix IB Device B (or even if unprogramming device B is taking a long time)
But now I'm thinking about what you said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's kind of weird, because in this case, a pod can have multiple IBs, but we want to treat each IB device as independent in a way, in case something goes wrong (or is slow), to use the free IB devices as soon possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's imagine for a minute we did the first way you said:
/ibdevices/{mac}
/ibdevices/{mac}/pod
In this case, how would the partner team, assign multiple IB devices to a single pod?
And then, let's imagine the other way:
/pods/{pod}
/pods/{pod}/ibdevices
This way I see how we could assign multiple IB devices to a single pod
But, like you said, we don't own all the pods nor control them, so it's just a bit odd
And then back to that situation where one IB device is slower to unprogram than another, we want to be able to get that info back to the partner team quickly, so in this case, the partner team would GET on /pod/{pod}/ibdevice
, but have to remember the pod I guess, which would mean they would have to keep track of which pod they assigned it to or something like that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, basically, when discussing this, they said:
"Oh yeah, we'll want to know the status of individual devices, to be able to use them for other pods easily, so let's make it GET on mac"
Just trying to relay that info here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is all manageable with query params on an ibdevices collection
GET /ibdevices?state=assigned
to find all the assigned ibdevices
GET /ibdevices?state=unassigned
unassigned
GET /ibdevices?by-pod=podnamespacedname
POST /ibdevices:assign?ibdevs=mac1,mac2&pod=podnamespacedname
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, one thing I should mention, is what the client of these APIs is doing
The client is going to scan for the MACs on the L1VH VM on startup, and choose devices that belong to the same NUMA nodes (within the L1VH VM)
CNS will not know which devices are part of the same NUMA nodes, the partner team's client will know that
So we weren't planning on having CNS scan for all devices on startup as well, since the client would be doing that
That being said, I do see a potential problem, where the client could give CNS garbage (like a non-existent device id), and CNS would just pass that along to DNC
DNC would be the one that discovers the device does not exist, and have to trickle that knowledge back to CNS
So I think with what you just shared, you're assuming CNS will know all the devices on the L1VH at any given point in time, is that correct?
CNS IBDevice API Contracts
Overview
Two new APIs for managing InfiniBand (IB) devices in Azure Container Network Service (CNS):
API 1: POST IB Devices for Pod
Endpoint
Request Body
Success Response
Error Response
See
errorCodes.go
in this PR for full list of error codes.API 2: GET IB Device Information
Endpoint
Request
No request body (MAC address provided in URL path)
Success Response
Device Not Found Response
See
status.go
in this PR for list of all statuses of an IB deviceRequest body
See
api.go
for the structsAssignIBDevicesToPodRequest
AssignIBDevicesToPodResponse
GetIBDeviceInfoRequest
GetIBDeviceInfoResponse