Skip to content

ofi/common: fix code that broke sessions #12870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 22 additions & 8 deletions opal/mca/common/ofi/common_ofi.c
Original file line number Diff line number Diff line change
Expand Up @@ -324,10 +324,11 @@ int opal_common_ofi_providers_subset_of_list(struct fi_info *provider_list, char

int opal_common_ofi_mca_register(const mca_base_component_t *component)
{
static int include_index = -1;
static int exclude_index = -1;
static int verbose_index = -1;
static int accelerator_rank_index = -1;
int include_index;
int exclude_index;
int verbose_index;
int accelerator_rank_index;
int param;
int ret;

if (fi_version() < FI_VERSION(1, 0)) {
Expand All @@ -336,7 +337,8 @@ int opal_common_ofi_mca_register(const mca_base_component_t *component)

OPAL_THREAD_LOCK(&opal_common_ofi_mutex);

if (0 > include_index) {
param = mca_base_var_find("opal", "opal_common", "ofi", "provider_include");
if (0 > param) {
/*
* this monkey business is needed because of the way the MCA VARs stuff tries to handle
* pointers to strings when when destructing the MCA var database. If you don't do
Expand All @@ -359,9 +361,12 @@ int opal_common_ofi_mca_register(const mca_base_component_t *component)
ret = include_index;
goto err;
}
} else {
include_index = param;
}

if (0 > exclude_index) {
param = mca_base_var_find("opal", "opal_common", "ofi", "provider_exclude");
if (0 > param) {
if (NULL == opal_common_ofi.prov_exclude) {
opal_common_ofi.prov_exclude = (char **) malloc(sizeof(char *));
assert(NULL != opal_common_ofi.prov_exclude);
Expand All @@ -378,9 +383,12 @@ int opal_common_ofi_mca_register(const mca_base_component_t *component)
ret = exclude_index;
goto err;
}
} else {
exclude_index = param;
}

if (0 > verbose_index) {
param = mca_base_var_find("opal", "opal_common", "ofi", "verbose");
if (0 > param) {
verbose_index = mca_base_var_register("opal", "opal_common", "ofi", "verbose",
"Verbose level of the OFI components",
MCA_BASE_VAR_TYPE_INT, NULL, 0,
Expand All @@ -391,9 +399,13 @@ int opal_common_ofi_mca_register(const mca_base_component_t *component)
ret = verbose_index;
goto err;
}
} else {
verbose_index = param;
}

if (0 > accelerator_rank_index) {

param = mca_base_var_find("opal", "opal_common", "ofi", "accelerator_rank");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, there is a case that could be made that this should not be an ofi parameter, but an mca parameter of the accelerator framework (for a later PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I was just keeping things the way they were. I've no objection to moving it in the end to accel framework.

if (0 > param) {
accelerator_rank_index
= mca_base_var_register("opal", "opal_common", "ofi", "accelerator_rank",
"Process rank(non-negative) on the selected accelerator device",
Expand All @@ -404,6 +416,8 @@ int opal_common_ofi_mca_register(const mca_base_component_t *component)
ret = accelerator_rank_index;
goto err;
}
} else {
accelerator_rank_index = param;
}

if (component) {
Expand Down
Loading