Skip to content

Commit bf40f80

Browse files
committed
Remove the need for most locking in memory.c.
Using thread local storage for tracking memory allocations means that threads no longer have to lock at all when doing memory allocations / frees. This particularly helps the gemm driver since it does an allocation per invocation. Even without threading at all, this helps, since even calling a lock with no contention has a cost: Before this change, no threading: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 102 ns 102 ns 13504412 BM_SGEMM/6 175 ns 175 ns 7997580 BM_SGEMM/8 205 ns 205 ns 6842073 BM_SGEMM/10 266 ns 266 ns 5294919 BM_SGEMM/16 478 ns 478 ns 2963441 BM_SGEMM/20 690 ns 690 ns 2144755 BM_SGEMM/32 1906 ns 1906 ns 716981 BM_SGEMM/40 2983 ns 2983 ns 473218 BM_SGEMM/64 9421 ns 9422 ns 148450 BM_SGEMM/72 12630 ns 12631 ns 112105 BM_SGEMM/80 15845 ns 15846 ns 89118 BM_SGEMM/90 25675 ns 25676 ns 54332 BM_SGEMM/100 29864 ns 29865 ns 47120 BM_SGEMM/112 37841 ns 37842 ns 36717 BM_SGEMM/128 56531 ns 56532 ns 25361 BM_SGEMM/140 75886 ns 75888 ns 18143 BM_SGEMM/150 98493 ns 98496 ns 14299 BM_SGEMM/160 102620 ns 102622 ns 13381 BM_SGEMM/170 135169 ns 135173 ns 10231 BM_SGEMM/180 146170 ns 146172 ns 9535 BM_SGEMM/189 190226 ns 190231 ns 7397 BM_SGEMM/200 194513 ns 194519 ns 7210 BM_SGEMM/256 396561 ns 396573 ns 3531 ``` with this change: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 14500387 BM_SGEMM/6 166 ns 166 ns 8381763 BM_SGEMM/8 196 ns 196 ns 7277044 BM_SGEMM/10 256 ns 256 ns 5515721 BM_SGEMM/16 463 ns 463 ns 3025197 BM_SGEMM/20 636 ns 636 ns 2070213 BM_SGEMM/32 1885 ns 1885 ns 739444 BM_SGEMM/40 2969 ns 2969 ns 472152 BM_SGEMM/64 9371 ns 9372 ns 148932 BM_SGEMM/72 12431 ns 12431 ns 112919 BM_SGEMM/80 15615 ns 15616 ns 89978 BM_SGEMM/90 25397 ns 25398 ns 55041 BM_SGEMM/100 29445 ns 29446 ns 47540 BM_SGEMM/112 37530 ns 37531 ns 37286 BM_SGEMM/128 55373 ns 55375 ns 25277 BM_SGEMM/140 76241 ns 76241 ns 18259 BM_SGEMM/150 102196 ns 102200 ns 13736 BM_SGEMM/160 101521 ns 101525 ns 13556 BM_SGEMM/170 136182 ns 136184 ns 10567 BM_SGEMM/180 146861 ns 146864 ns 9035 BM_SGEMM/189 192632 ns 192632 ns 7231 BM_SGEMM/200 198547 ns 198555 ns 6995 BM_SGEMM/256 392316 ns 392330 ns 3539 ``` Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost of small matrix operations was overshadowed by thread locking (look smaller than 32) even when not explicitly spawning threads: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 328 ns 328 ns 4170562 BM_SGEMM/6 396 ns 396 ns 3536400 BM_SGEMM/8 418 ns 418 ns 3330102 BM_SGEMM/10 491 ns 491 ns 2863047 BM_SGEMM/16 710 ns 710 ns 2028314 BM_SGEMM/20 871 ns 871 ns 1581546 BM_SGEMM/32 2132 ns 2132 ns 657089 BM_SGEMM/40 3197 ns 3196 ns 437969 BM_SGEMM/64 9645 ns 9645 ns 144987 BM_SGEMM/72 35064 ns 32881 ns 50264 BM_SGEMM/80 37661 ns 35787 ns 42080 BM_SGEMM/90 36507 ns 36077 ns 40091 BM_SGEMM/100 32513 ns 31850 ns 48607 BM_SGEMM/112 41742 ns 41207 ns 37273 BM_SGEMM/128 67211 ns 65095 ns 21933 BM_SGEMM/140 68263 ns 67943 ns 19245 BM_SGEMM/150 121854 ns 115439 ns 10660 BM_SGEMM/160 116826 ns 115539 ns 10000 BM_SGEMM/170 126566 ns 122798 ns 11960 BM_SGEMM/180 130088 ns 127292 ns 11503 BM_SGEMM/189 120309 ns 116634 ns 13162 BM_SGEMM/200 114559 ns 110993 ns 10000 BM_SGEMM/256 217063 ns 207806 ns 6417 ``` and after, it's gone (note this includes my other change which reduces calls to num_cpu_avail): ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 12347650 BM_SGEMM/6 166 ns 166 ns 8259683 BM_SGEMM/8 193 ns 193 ns 7162210 BM_SGEMM/10 258 ns 258 ns 5415657 BM_SGEMM/16 471 ns 471 ns 2981009 BM_SGEMM/20 666 ns 666 ns 2148002 BM_SGEMM/32 1903 ns 1903 ns 738245 BM_SGEMM/40 2969 ns 2969 ns 473239 BM_SGEMM/64 9440 ns 9440 ns 148442 BM_SGEMM/72 37239 ns 33330 ns 46813 BM_SGEMM/80 57350 ns 55949 ns 32251 BM_SGEMM/90 36275 ns 36249 ns 42259 BM_SGEMM/100 31111 ns 31008 ns 45270 BM_SGEMM/112 43782 ns 40912 ns 34749 BM_SGEMM/128 67375 ns 64406 ns 22443 BM_SGEMM/140 76389 ns 67003 ns 21430 BM_SGEMM/150 72952 ns 71830 ns 19793 BM_SGEMM/160 97039 ns 96858 ns 11498 BM_SGEMM/170 123272 ns 122007 ns 11855 BM_SGEMM/180 126828 ns 126505 ns 11567 BM_SGEMM/189 115179 ns 114665 ns 11044 BM_SGEMM/200 89289 ns 87259 ns 16147 BM_SGEMM/256 226252 ns 222677 ns 7375 ``` I've also tested this with ThreadSanitizer and found no data races during execution. I'm not sure why 200 is always faster than it's neighbors, we must be hitting some optimal cache size or something.
1 parent ed682a4 commit bf40f80

File tree

1 file changed

+43
-156
lines changed

1 file changed

+43
-156
lines changed

driver/others/memory.c

Lines changed: 43 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ modification, are permitted provided that the following conditions are
1313
notice, this list of conditions and the following disclaimer in
1414
the documentation and/or other materials provided with the
1515
distribution.
16-
3. Neither the name of the OpenBLAS project nor the names of
17-
its contributors may be used to endorse or promote products
18-
derived from this software without specific prior written
16+
3. Neither the name of the OpenBLAS project nor the names of
17+
its contributors may be used to endorse or promote products
18+
derived from this software without specific prior written
1919
permission.
2020
2121
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
@@ -139,6 +139,14 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
139139
#define FIXED_PAGESIZE 4096
140140
#endif
141141

142+
#ifndef BUFFERS_PER_THREAD
143+
#ifdef USE_OPENMP
144+
#define BUFFERS_PER_THREAD (MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER)
145+
#else
146+
#define BUFFERS_PER_THREAD NUM_BUFFERS
147+
#endif
148+
#endif
149+
142150
#define BITMASK(a, b, c) ((((a) >> (b)) & (c)))
143151

144152
#if defined(_MSC_VER) && !defined(__clang__)
@@ -213,7 +221,7 @@ int i,n;
213221
ret = sched_getaffinity(0,size,cpusetp);
214222
if (ret!=0) return nums;
215223
ret = CPU_COUNT_S(size,cpusetp);
216-
if (ret > 0 && ret < nums) nums = ret;
224+
if (ret > 0 && ret < nums) nums = ret;
217225
CPU_FREE(cpusetp);
218226
return nums;
219227
#endif
@@ -415,8 +423,15 @@ struct release_t {
415423

416424
int hugetlb_allocated = 0;
417425

418-
static struct release_t release_info[NUM_BUFFERS];
419-
static int release_pos = 0;
426+
#if defined(OS_WINDOWS)
427+
#define THREAD_LOCAL __declspec(thread)
428+
#define UNLIKELY_TO_BE_ZERO(x) (x)
429+
#else
430+
#define THREAD_LOCAL __thread
431+
#define UNLIKELY_TO_BE_ZERO(x) (__builtin_expect(x, 0))
432+
#endif
433+
static struct release_t THREAD_LOCAL release_info[BUFFERS_PER_THREAD];
434+
static int THREAD_LOCAL release_pos = 0;
420435

421436
#if defined(OS_LINUX) && !defined(NO_WARMUP)
422437
static int hot_alloc = 0;
@@ -459,15 +474,9 @@ static void *alloc_mmap(void *address){
459474
}
460475

461476
if (map_address != (void *)-1) {
462-
#if defined(SMP) && !defined(USE_OPENMP)
463-
LOCK_COMMAND(&alloc_lock);
464-
#endif
465477
release_info[release_pos].address = map_address;
466478
release_info[release_pos].func = alloc_mmap_free;
467479
release_pos ++;
468-
#if defined(SMP) && !defined(USE_OPENMP)
469-
UNLOCK_COMMAND(&alloc_lock);
470-
#endif
471480
}
472481

473482
#ifdef OS_LINUX
@@ -611,15 +620,9 @@ static void *alloc_mmap(void *address){
611620
#endif
612621

613622
if (map_address != (void *)-1) {
614-
#if defined(SMP) && !defined(USE_OPENMP)
615-
LOCK_COMMAND(&alloc_lock);
616-
#endif
617623
release_info[release_pos].address = map_address;
618624
release_info[release_pos].func = alloc_mmap_free;
619625
release_pos ++;
620-
#if defined(SMP) && !defined(USE_OPENMP)
621-
UNLOCK_COMMAND(&alloc_lock);
622-
#endif
623626
}
624627

625628
return map_address;
@@ -872,7 +875,7 @@ static void *alloc_hugetlb(void *address){
872875

873876
tp.PrivilegeCount = 1;
874877
tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
875-
878+
876879
if (LookupPrivilegeValue(NULL, SE_LOCK_MEMORY_NAME, &tp.Privileges[0].Luid) != TRUE) {
877880
CloseHandle(hToken);
878881
return (void*)-1;
@@ -961,20 +964,17 @@ static BLASULONG base_address = 0UL;
961964
static BLASULONG base_address = BASE_ADDRESS;
962965
#endif
963966

964-
static volatile struct {
965-
BLASULONG lock;
967+
struct memory_t {
966968
void *addr;
967-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
968-
int pos;
969-
#endif
970969
int used;
971970
#ifndef __64BIT__
972971
char dummy[48];
973972
#else
974973
char dummy[40];
975974
#endif
975+
};
976976

977-
} memory[NUM_BUFFERS];
977+
static struct memory_t THREAD_LOCAL memory[BUFFERS_PER_THREAD];
978978

979979
static int memory_initialized = 0;
980980

@@ -987,9 +987,6 @@ static int memory_initialized = 0;
987987
void *blas_memory_alloc(int procpos){
988988

989989
int position;
990-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
991-
int mypos;
992-
#endif
993990

994991
void *map_address;
995992

@@ -1020,102 +1017,48 @@ void *blas_memory_alloc(int procpos){
10201017
};
10211018
void *(**func)(void *address);
10221019

1023-
#if defined(USE_OPENMP)
1024-
if (!memory_initialized) {
1025-
#endif
1026-
1027-
LOCK_COMMAND(&alloc_lock);
1020+
if (UNLIKELY_TO_BE_ZERO(memory_initialized)) {
10281021

1029-
if (!memory_initialized) {
1022+
/* Only allow a single thread to initialize memory system */
1023+
LOCK_COMMAND(&alloc_lock);
10301024

1031-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
1032-
for (position = 0; position < NUM_BUFFERS; position ++){
1033-
memory[position].addr = (void *)0;
1034-
memory[position].pos = -1;
1035-
memory[position].used = 0;
1036-
memory[position].lock = 0;
1037-
}
1038-
#endif
1025+
if (!memory_initialized) {
10391026

10401027
#ifdef DYNAMIC_ARCH
1041-
gotoblas_dynamic_init();
1028+
gotoblas_dynamic_init();
10421029
#endif
10431030

10441031
#if defined(SMP) && defined(OS_LINUX) && !defined(NO_AFFINITY)
1045-
gotoblas_affinity_init();
1032+
gotoblas_affinity_init();
10461033
#endif
10471034

10481035
#ifdef SMP
1049-
if (!blas_num_threads) blas_cpu_number = blas_get_cpu_number();
1036+
if (!blas_num_threads) blas_cpu_number = blas_get_cpu_number();
10501037
#endif
10511038

10521039
#if defined(ARCH_X86) || defined(ARCH_X86_64) || defined(ARCH_IA64) || defined(ARCH_MIPS64) || defined(ARCH_ARM64)
10531040
#ifndef DYNAMIC_ARCH
1054-
blas_set_parameter();
1041+
blas_set_parameter();
10551042
#endif
10561043
#endif
10571044

1058-
memory_initialized = 1;
1045+
memory_initialized = 1;
10591046

1047+
}
1048+
UNLOCK_COMMAND(&alloc_lock);
10601049
}
1061-
UNLOCK_COMMAND(&alloc_lock);
1062-
#if defined(USE_OPENMP)
1063-
}
1064-
#endif
10651050

10661051
#ifdef DEBUG
10671052
printf("Alloc Start ...\n");
1068-
#endif
1069-
1070-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
1071-
1072-
mypos = WhereAmI();
1073-
1074-
position = mypos;
1075-
while (position >= NUM_BUFFERS) position >>= 1;
1076-
1077-
do {
1078-
if (!memory[position].used && (memory[position].pos == mypos)) {
1079-
#if defined(SMP) && !defined(USE_OPENMP)
1080-
LOCK_COMMAND(&alloc_lock);
1081-
#else
1082-
blas_lock(&memory[position].lock);
1083-
#endif
1084-
if (!memory[position].used) goto allocation;
1085-
#if defined(SMP) && !defined(USE_OPENMP)
1086-
UNLOCK_COMMAND(&alloc_lock);
1087-
#else
1088-
blas_unlock(&memory[position].lock);
1089-
#endif
1090-
}
1091-
1092-
position ++;
1093-
1094-
} while (position < NUM_BUFFERS);
1095-
1096-
10971053
#endif
10981054

10991055
position = 0;
11001056

11011057
do {
1102-
#if defined(SMP) && !defined(USE_OPENMP)
1103-
LOCK_COMMAND(&alloc_lock);
1104-
#else
1105-
if (!memory[position].used) {
1106-
blas_lock(&memory[position].lock);
1107-
#endif
11081058
if (!memory[position].used) goto allocation;
1109-
#if defined(SMP) && !defined(USE_OPENMP)
1110-
UNLOCK_COMMAND(&alloc_lock);
1111-
#else
1112-
blas_unlock(&memory[position].lock);
1113-
}
1114-
#endif
1115-
11161059
position ++;
11171060

1118-
} while (position < NUM_BUFFERS);
1061+
} while (position < BUFFERS_PER_THREAD);
11191062

11201063
goto error;
11211064

@@ -1126,11 +1069,6 @@ void *blas_memory_alloc(int procpos){
11261069
#endif
11271070

11281071
memory[position].used = 1;
1129-
#if defined(SMP) && !defined(USE_OPENMP)
1130-
UNLOCK_COMMAND(&alloc_lock);
1131-
#else
1132-
blas_unlock(&memory[position].lock);
1133-
#endif
11341072

11351073
if (!memory[position].addr) {
11361074
do {
@@ -1148,14 +1086,14 @@ void *blas_memory_alloc(int procpos){
11481086

11491087
#ifdef ALLOC_DEVICEDRIVER
11501088
if ((*func == alloc_devicedirver) && (map_address == (void *)-1)) {
1151-
fprintf(stderr, "OpenBLAS Warning ... Physically contigous allocation was failed.\n");
1089+
fprintf(stderr, "OpenBLAS Warning ... Physically contiguous allocation failed.\n");
11521090
}
11531091
#endif
11541092

11551093
#ifdef ALLOC_HUGETLBFILE
11561094
if ((*func == alloc_hugetlbfile) && (map_address == (void *)-1)) {
11571095
#ifndef OS_WINDOWS
1158-
fprintf(stderr, "OpenBLAS Warning ... HugeTLB(File) allocation was failed.\n");
1096+
fprintf(stderr, "OpenBLAS Warning ... HugeTLB(File) allocation failed.\n");
11591097
#endif
11601098
}
11611099
#endif
@@ -1176,44 +1114,13 @@ void *blas_memory_alloc(int procpos){
11761114

11771115
} while ((BLASLONG)map_address == -1);
11781116

1179-
#if defined(SMP) && !defined(USE_OPENMP)
1180-
LOCK_COMMAND(&alloc_lock);
1181-
#endif
11821117
memory[position].addr = map_address;
1183-
#if defined(SMP) && !defined(USE_OPENMP)
1184-
UNLOCK_COMMAND(&alloc_lock);
1185-
#endif
11861118

11871119
#ifdef DEBUG
11881120
printf(" Mapping Succeeded. %p(%d)\n", (void *)memory[position].addr, position);
11891121
#endif
11901122
}
11911123

1192-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
1193-
1194-
if (memory[position].pos == -1) memory[position].pos = mypos;
1195-
1196-
#endif
1197-
1198-
#ifdef DYNAMIC_ARCH
1199-
1200-
if (memory_initialized == 1) {
1201-
1202-
LOCK_COMMAND(&alloc_lock);
1203-
1204-
if (memory_initialized == 1) {
1205-
1206-
if (!gotoblas) gotoblas_dynamic_init();
1207-
1208-
memory_initialized = 2;
1209-
}
1210-
1211-
UNLOCK_COMMAND(&alloc_lock);
1212-
1213-
}
1214-
#endif
1215-
1216-
12171124
#ifdef DEBUG
12181125
printf("Mapped : %p %3d\n\n",
12191126
(void *)memory[position].addr, position);
@@ -1222,7 +1129,7 @@ void *blas_memory_alloc(int procpos){
12221129
return (void *)memory[position].addr;
12231130

12241131
error:
1225-
printf("BLAS : Program is Terminated. Because you tried to allocate too many memory regions.\n");
1132+
printf("OpenBLAS : Program will terminate because you tried to allocate too many memory regions.\n");
12261133

12271134
return NULL;
12281135
}
@@ -1236,10 +1143,7 @@ void blas_memory_free(void *free_area){
12361143
#endif
12371144

12381145
position = 0;
1239-
#if defined(SMP) && !defined(USE_OPENMP)
1240-
LOCK_COMMAND(&alloc_lock);
1241-
#endif
1242-
while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
1146+
while ((position < BUFFERS_PER_THREAD) && (memory[position].addr != free_area))
12431147
position++;
12441148

12451149
if (memory[position].addr != free_area) goto error;
@@ -1248,13 +1152,7 @@ void blas_memory_free(void *free_area){
12481152
printf(" Position : %d\n", position);
12491153
#endif
12501154

1251-
// arm: ensure all writes are finished before other thread takes this memory
1252-
WMB;
1253-
12541155
memory[position].used = 0;
1255-
#if defined(SMP) && !defined(USE_OPENMP)
1256-
UNLOCK_COMMAND(&alloc_lock);
1257-
#endif
12581156

12591157
#ifdef DEBUG
12601158
printf("Unmap Succeeded.\n\n");
@@ -1266,11 +1164,8 @@ void blas_memory_free(void *free_area){
12661164
printf("BLAS : Bad memory unallocation! : %4d %p\n", position, free_area);
12671165

12681166
#ifdef DEBUG
1269-
for (position = 0; position < NUM_BUFFERS; position++)
1167+
for (position = 0; position < BUFFERS_PER_THREAD; position++)
12701168
printf("%4ld %p : %d\n", position, memory[position].addr, memory[position].used);
1271-
#endif
1272-
#if defined(SMP) && !defined(USE_OPENMP)
1273-
UNLOCK_COMMAND(&alloc_lock);
12741169
#endif
12751170
return;
12761171
}
@@ -1293,8 +1188,6 @@ void blas_shutdown(void){
12931188
BLASFUNC(blas_thread_shutdown)();
12941189
#endif
12951190

1296-
LOCK_COMMAND(&alloc_lock);
1297-
12981191
for (pos = 0; pos < release_pos; pos ++) {
12991192
release_info[pos].func(&release_info[pos]);
13001193
}
@@ -1305,17 +1198,11 @@ void blas_shutdown(void){
13051198
base_address = BASE_ADDRESS;
13061199
#endif
13071200

1308-
for (pos = 0; pos < NUM_BUFFERS; pos ++){
1201+
for (pos = 0; pos < BUFFERS_PER_THREAD; pos ++){
13091202
memory[pos].addr = (void *)0;
13101203
memory[pos].used = 0;
1311-
#if defined(WHEREAMI) && !defined(USE_OPENMP)
1312-
memory[pos].pos = -1;
1313-
#endif
1314-
memory[pos].lock = 0;
13151204
}
13161205

1317-
UNLOCK_COMMAND(&alloc_lock);
1318-
13191206
return;
13201207
}
13211208

0 commit comments

Comments
 (0)