You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: affinity/cpp-20/d0796r2.md
+32-14Lines changed: 32 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,9 +100,10 @@ Some systems give additional user control through explicit binding of threads to
100
100
101
101
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions. These include:
102
102
103
-
* How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
104
-
* How to query and measure the relative affininty between different execution resources within a system.
105
-
* How to bind execution and allocation particular execution resource(s).
103
+
* How to represent, identify and navigate the topology of execution and memory resources available within a heterogeneous or distributed system.
104
+
* How to query and measure the relative affininty between execution and memory resources within a system.
105
+
* How to bind execution to particular execution resource(s).
106
+
* How to bind allocation to particular memory resource(s).
106
107
* What kind of and level of interface(s) should be provided by C++ for affinity.
107
108
108
109
Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems.
@@ -114,25 +115,39 @@ There are also some additional challenges which we have been investigating but a
114
115
115
116
### Querying and representing the system topology
116
117
117
-
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
118
+
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources* and *memory resources*.
118
119
119
120
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r4] leaves the *execution resource* largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
120
121
121
-
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
122
+
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources* and *memory resources*. As both the level of abstraction of resources and the granularity that they describe in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose generic hierarchical structures for *execution resources* and *memory resources*, where each *resource* being composed of other *resources* recursively.
122
123
123
-
For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing agents. A system with both CPUs and GPUs (programmable graphics processing units) may have GPU local memory regions capable of placing memory, but not capable of placing agents.
124
+
Each *execution resource* within its hierarchy can be used to place execution (i.e. bind an execution to an *execution resource*).
125
+
Each *memory resource* within its hierarchy can be used to place memory (i.e., allocate memory within the *memory resource*).
126
+
For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing execution. A CPU + GPU system may have GPU local memory regions capable of placing memory, but not capable of placing execution.
124
127
125
-
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the hardware as a tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath. Placement of these I/O units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
128
+
Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the execution and memory hardware as a single tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
126
129
127
130

128
131
129
-
The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources. However, systems are becoming increasingly non-hierarchical and a traditional tree-based representation of a *system’s resource topology* may not suffice any more [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
132
+
The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources.
130
133
131
-
The interface for querying the *resource topology* of a *system* must be flexible enough to allow querying all *execution resources* available under an *execution context*, querying the *execution resources* available to the entire system, and constructing an *execution context* for a particular *execution resource*. This is important, as many standards such as OpenCL [[6]][opencl-2-2] and HSA [[7]][hsa] require the ability to query the *resource topology* available in a *system* before constructing an *execution context* for executing work.
134
+
Some heterogeneous systems execution and memory resources are not naturally represented by a single tree [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
135
+
136
+
The interface for querying the *resource topology* of a *system*
137
+
must be flexible enough to allow
138
+
querying the *execution resources* and *memory resources*
139
+
available to the entire system,
140
+
affinity between an *execution resource* and *memory resource*,
141
+
querying the *execution resource* associated with an *execution context*, and
142
+
constructing an *execution context* for a particular *execution resource*.
143
+
This is important, as many standards such as OpenCL [[6]][opencl-2-2]
144
+
and HSA [[7]][hsa] require the ability to query the *resource topology*
145
+
available in a *system* before constructing an *execution context*
146
+
for executing work.
132
147
133
148
> For example, an implementation may provide an execution context for a particular execution resource such as a static thread pool or a GPU context for a particular GPU device, or an implementation may provide a more generic execution context which can be constructed from a number of CPU and GPU devices queryable through the system resource topology.
134
149
135
-
### Topology discovery & fault tolerance
150
+
### Dynamic resource discovery & fault tolerance: currently out of scope
136
151
137
152
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
138
153
@@ -162,7 +177,7 @@ The initial solution should target systems with a single addressable memory regi
162
177
163
178
### Querying the relative affinity of partitions
164
179
165
-
In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between different *execution resources*. This is usually expressed in terms of latency between two resources. Distance does not need to be symmetric in all architectures.
180
+
In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between an *execution resource* and a *memory resource*. This is usually expressed in terms of latency between these resources. Distance does not need to be symmetric in all architectures.
166
181
167
182
The relative position of two components in the topology does not necessarily indicate their affinity. For example, two cores from two different CPU sockets may have the same latency to access the same NUMA memory node.
168
183
@@ -191,9 +206,12 @@ traditional *context of execution* usage that refers
191
206
to the state of a single executing callable; *e.g.*,
192
207
program counter, registers, stack frame. *--end note*]
193
208
194
-
The **concurrency** of an execution resource is the maximum number
195
-
of execution agents that could make concurrent forward progress
209
+
The **concurrency** of an execution resource is an upper bound of the
210
+
number of execution agents that could make concurrent forward progress
196
211
on that execution resource.
212
+
It is guaranteed that no more than **concurrency** execution agents
213
+
could make concurrent forward progress; it is not guaranteed that
214
+
**concurrency** execution agents will ever make concurrent forward progress.
197
215
198
216
### Execution resources
199
217
@@ -225,7 +243,7 @@ The `execution_resource` which underlies the current thread of execution can be
225
243
226
244
### Querying relative affinity
227
245
228
-
The `affinity_query` class template provides an abstraction for a relative affinity value between two `execution_resource`s. This value depends on a particular `affinity_operation` and `affinity_metric`. As a result, the `affinity_query` is templated on`affinity_operation` and `affinity_metric`, and is constructed from two `execution_resource`s. An `affinity_query`is not meant to be meaningful on its own. Instead, users are meant to compare two queries with comparison operators, in order to get a relative magnitude of affinity. If necessary, the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
246
+
The `affinity_query` class template provides an abstraction for a relative affinity value between an execution resource and a memory resource, derived from a particular `affinity_operation` and `affinity_metric`. The `affinity_query` is templated by`affinity_operation` and `affinity_metric` and is constructed from an execution resource and memory resource. An `affinity_query`does not mean much on it's own, instead a relative magnitude of affinity can be queried by using comparison operators. If nessesary the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
229
247
230
248
Below *(listing 3)* is an example of how to query the relative affinity between two `execution_resource`s.
0 commit comments