-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Description
Summary
In the example below, the PTX assembly generated by Clang declares the kernel as .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_. The same example compiled with NVCC generates .weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_968400945Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_. In both cases, external weak linkage is used, which is not necessary since it is coming from inside a anonymous namespace. With NVCC, this is not a problem because the anonymous namespace is mangled to a unique name (60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_96840094). With Clang however, the name is not unique (mangled to 12_GLOBAL__N_1). This is a problem when passing the resulting object files to nvlink, which will report nvlink fatal error: Internal error: duplicate parameter bank data not same size or nvlink error: Duplicate weak parameter bank for ... depending on the CUDA version. To me, it seems like internal linkage (.entry) instead of weak linkage (.weak .entry) should be used in this case.
Versions
I reproduced the bug with multiple Clang versions between 12.0.0 and 14.0.0. Before abd8cd9 by @yxsamliu, Clang would generate .visible .entry instead of .weak .entry, which isn't any better and actually causes the example below to fail earlier on Multiple definition of '_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv'().
Potential workaround
It seems like the Clang option -funique-internal-linkage-names should be usable as a workaround that forces the symbol names to be unique, however across all Clang versions this just gives me various internal compiler errors. But that's a different issue.
Working example
This example uses Thrust, which makes the symbol names very lengthy, but I am pretty sure the exact same behavior can also be observed by replacing thrust::transform with a hand-written kernel. The lambda capture seems to be important as just putting a kernel into an anonymous namespace is not sufficient to trigger the problem.
a.cu:
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
namespace {
struct Stuff {
long c[2];
};
struct Thing {
void calc(int *data, int n, Stuff s) {
auto f = [s] __device__ (int i) -> int {
return 2*i;
};
auto first = thrust::counting_iterator<int>(0);
auto last = first + n;
thrust::transform(thrust::device, first, last, data, f);
}
};
}
void runA(int * data, int n) {
Thing t;
Stuff s({0, 0});
t.calc(data, n, s);
}b.cu:
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
namespace {
struct Stuff {
long c[4];
};
struct Thing {
void calc(int *data, int n, Stuff s) {
auto f = [s] __device__ (int i) -> int {
return 2*i;
};
auto first = thrust::counting_iterator<int>(0);
auto last = first + n;
thrust::transform(thrust::device, first, last, data, f);
}
};
}
void runB(int * data, int n) {
Thing t;
Stuff s({0, 0, 0, 0});
t.calc(data, n, s);
}Compile and link these as follows:
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c a.cu -fgpu-rdc -o a.o
clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 -fPIC -c b.cu -fgpu-rdc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
clang++ -std=c++14 -fPIC -O3 -shared -o ab.so a.o b.o device.oThe second-to-last command will fail with the nvlink error given in the summary.
For comparison, compile and link with NVCC:
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c a.cu -dc -o a.o
nvcc -O3 -std=c++14 --expt-extended-lambda -gencode=arch=compute_70,code=[sm_70,compute_70] -Xcompiler -fPIC -c b.cu -dc -o b.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -dlink a.o b.o -o device.o
nvcc -O3 -Xcompiler -fPIC -gencode=arch=compute_70,code=[sm_70,compute_70] -shared a.o b.o device.o -o ab.soThis will succeed.
LLVM IR
Unfortunately, Thrust is currently broken on clang trunk on godbolt.org, so I cannot post a link to the LLVM IR. Running clang locally, I can see that the IR uses define weak_odr void @_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_(%.... My guess is that it should probably use some variation of private or internal instead of weak_odr.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status