Separating execution and memory resource hierarchies.

hcedwar · hcedwar · commit 68845491d03a · 2018-05-06T15:05:27.000-07:00
diff --git a/affinity/cpp-20/d0796r2.md b/affinity/cpp-20/d0796r2.md
@@ -100,9 +100,10 @@ Some systems give additional user control through explicit binding of threads to
 
 In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions.  These include:
 
-* How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
-* How to query and measure the relative affininty between different execution resources within a system.
-* How to bind execution and allocation particular execution resource(s).
+* How to represent, identify and navigate the topology of execution and memory resources available within a heterogeneous or distributed system.
+* How to query and measure the relative affininty between execution and memory resources within a system.
+* How to bind execution to particular execution resource(s).
+* How to bind allocation to particular memory resource(s).
 * What kind of and level of interface(s) should be provided by C++ for affinity.
 
 Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems.
@@ -114,25 +115,39 @@ There are also some additional challenges which we have been investigating but a
 
 ### Querying and representing the system topology
 
-The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
+The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources* and *memory resources*.
 
 The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r4] leaves the *execution resource* largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
 
-Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
+Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the topology's *execution resources* and *memory resources*. As both the level of abstraction of resources and the granularity that they describe in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose generic hierarchical structures for *execution resources* and *memory resources*, where each *resource* being composed of other *resources* recursively. 
 
-For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing agents.  A system with both CPUs and GPUs (programmable graphics processing units) may have GPU local memory regions capable of placing memory, but not capable of placing agents.
+Each *execution resource* within its hierarchy can be used to place execution (i.e. bind an execution to an *execution resource*).
+Each *memory resource* within its hierarchy can be used to place memory (i.e., allocate memory within the *memory resource*).
+For example, a NUMA system will likely have a hierarchy of nodes, each capable of placing memory and placing execution.  A CPU + GPU system may have GPU local memory regions capable of placing memory, but not capable of placing execution.
 
-Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the hardware as a tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath. Placement of these I/O units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
+Nowadays, there are various APIs and libraries that enable this functionality. One of the most commonly used is the [Portable Hardware Locality (hwloc)][hwloc]. Hwloc presents the execution and memory hardware as a single tree, where the root node represents the whole machine and subsequent levels represent different partitions depending on different hardware characteristics. The picture below shows the output of the hwloc visualization tool (lstopo) on a 2-socket Xeon E5300 server. Note that each socket is represented by a package in the graph. Each socket contains its own cache memories, but both share the same NUMA memory region. Note also that different I/O units are visible underneath: Placement of these units with respect to memory and threads can be critical to performance. The ability to place threads and/or allocate memory appropriately on the different components of this system is an important part of the process of application development, especially as hardware architectures get more complex. The documentation of lstopo [[21]][lstopo] shows more interesting examples of topologies that appear on today's systems.
 
 ![alt text](hwloc-topology.png "Hwloc topology")
 
-The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources. However, systems are becoming increasingly non-hierarchical and a traditional tree-based representation of a *system’s resource topology* may not suffice any more [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
+The interface of `thread_execution_resource_t` proposed in the execution context proposal [[23]][p0737r0] proposes a hierarchical approach where there is a root resource and each resource has a number of child resources. 
 
-The interface for querying the *resource topology* of a *system* must be flexible enough to allow querying all *execution resources* available under an *execution context*, querying the *execution resources* available to the entire system, and constructing an *execution context* for a particular *execution resource*. This is important, as many standards such as OpenCL [[6]][opencl-2-2] and HSA [[7]][hsa] require the ability to query the *resource topology* available in a *system* before constructing an *execution context* for executing work.
+Some heterogeneous systems execution and memory resources are not naturally represented by a single tree [[24]][exposing-locality]. The HSA standard solves this problem by allowing a node in the topology to have multiple parent nodes [19].
+
+The interface for querying the *resource topology* of a *system* 
+must be flexible enough to allow 
+querying the *execution resources* and *memory resources*
+available to the entire system,
+affinity between an *execution resource* and *memory resource*,
+querying the *execution resource* associated with an *execution context*, and
+constructing an *execution context* for a particular *execution resource*. 
+This is important, as many standards such as OpenCL [[6]][opencl-2-2] 
+and HSA [[7]][hsa] require the ability to query the *resource topology* 
+available in a *system* before constructing an *execution context* 
+for executing work.
 
 > For example, an implementation may provide an execution context for a particular execution resource such as a static thread pool or a GPU context for a particular GPU device, or an implementation may provide a more generic execution context which can be constructed from a number of CPU and GPU devices queryable through the system resource topology.
 
-### Topology discovery & fault tolerance
+### Dynamic resource discovery & fault tolerance: currently out of scope
 
 In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
 
@@ -162,7 +177,7 @@ The initial solution should target systems with a single addressable memory regi
 
 ### Querying the relative affinity of partitions
 
-In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between different *execution resources*. This is usually expressed in terms of latency between two resources. Distance does not need to be symmetric in all architectures.
+In order to make decisions about where to place execution or allocate memory in a given *system’s resource topology*, it is important to understand the concept of affinity between an *execution resource* and a *memory resource*. This is usually expressed in terms of latency between these resources. Distance does not need to be symmetric in all architectures.
 
 The relative position of two components in the topology does not necessarily indicate their affinity. For example, two cores from two different CPU sockets may have the same latency to access the same NUMA memory node.
 
@@ -191,9 +206,12 @@ traditional *context of execution* usage that refers
 to the state of a single executing callable; *e.g.*,
 program counter, registers, stack frame. *--end note*]
 
-The **concurrency** of an execution resource is the maximum number 
-of execution agents that could make concurrent forward progress 
+The **concurrency** of an execution resource is an upper bound of the
+number of execution agents that could make concurrent forward progress 
 on that execution resource. 
+It is guaranteed that no more than **concurrency** execution agents 
+could make concurrent forward progress; it is not guaranteed that
+**concurrency** execution agents will ever make concurrent forward progress.
 
 ### Execution resources
 
@@ -225,7 +243,7 @@ The `execution_resource` which underlies the current thread of execution can be
 
 ### Querying relative affinity
 
-The `affinity_query` class template provides an abstraction for a relative affinity value between two `execution_resource`s. This value depends on a particular `affinity_operation` and `affinity_metric`. As a result, the `affinity_query` is templated on `affinity_operation` and `affinity_metric`, and is constructed from two `execution_resource`s. An `affinity_query` is not meant to be meaningful on its own. Instead, users are meant to compare two queries with comparison operators, in order to get a relative magnitude of affinity. If necessary, the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
+The `affinity_query` class template provides an abstraction for a relative affinity value between an execution resource and a memory resource, derived from a particular `affinity_operation` and `affinity_metric`. The `affinity_query` is templated by `affinity_operation` and `affinity_metric` and is constructed from an execution resource and memory resource. An `affinity_query` does not mean much on it's own, instead a relative magnitude of affinity can be queried by using comparison operators. If nessesary the value of an `affinity_query` can also be queried through `native_affinity`, though the return value of this is implementation defined.
 
 Below *(listing 3)* is an example of how to query the relative affinity between two `execution_resource`s.