Skip to content

Commit 0b3feb3

Browse files
committed
front-end mirroring -> device mapping
1 parent e9ed210 commit 0b3feb3

File tree

4 files changed

+28
-23
lines changed

4 files changed

+28
-23
lines changed

rfcs/20200624-pluggable-device-for-tensorflow.md

Lines changed: 28 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ With the RFC, existing TensorFlow GPU programs can run on a plugged device witho
5050

5151
This section describes the user scenarios that are supported/unsupported for PluggableDevice.
5252
* **Supported scenario**: Single PluggableDevice registered as "GPU" device type
53-
In the case of installing one plugin that registers its PluggableDevice as "GPU" device type, the default GPUDevice will be invalid when the plugin is loaded. When user specifies the "GPU" device for ops under `with tf.device("gpu:0")`, PluggableDevice registered will be selected to run those ops.
53+
In the case of installing one plugin that registers its PluggableDevice as "GPU" device type, the default GPUDevice will be overrided by PluggableDevice when the plugin is loaded. When user specifies the "GPU" device for ops under `with tf.device("gpu:0")`, PluggableDevice registered will be selected to run those ops.
5454
<div align="center">
5555
<img src=20200624-pluggable-device-for-tensorflow/scenario1.png>
5656
</div>
@@ -68,43 +68,49 @@ This section describes the user scenarios that are supported/unsupported for Plu
6868
</div>
6969

7070
* **Non-Supported scenario**: Multiple PluggableDevices registered as the same device type.
71-
In the case of installing multiple plugins that registers PluggableDevice as the same device type, e.g., more than one plugin registers its PluggableDevice as "GPU" device type, these plugins's initialization will fail due to registration conflict. User needs to manually select which platform they want to use(either unloads the conflicting plugin or reconfigures the plugin with python API).
71+
In the case of installing multiple plugins that registers PluggableDevice as the same device type, e.g., more than one plugin registers its PluggableDevice as "GPU" device type, these plugins's initialization will fail due to registration conflict. Users need to manually select which platform they want to use(either unloads the conflicting plugin or reconfigures the plugin with python API).
7272
<div align="center">
7373
<img src=20200624-pluggable-device-for-tensorflow/scenario4.png>
7474
</div>
7575

76-
### Front-end Mirroring mechanism
77-
This section describes the front-end mirroring mechanism for python users, pointing at previous user scenarios.
76+
### Device mapping mechanism
77+
This section describes the device mapping mechanism for python users, pointing at previous user scenarios.
7878
* **Device type && Subdevice type**
79-
Device type is user visible. User can specify the device type for the ops. e.g, "gpu", "xpu", "cpu". Subdevice type is user visible and user can specify which subdevice to use for the device type(mirroring), e.g.("NVIDIA_GPU", "INTEL_GPU", "AMD_GPU").
79+
Device type is user visible. User can specify the device type for the ops. e.g, "gpu", "xpu", "cpu". Subdevice type is user visible and user can specify which subdevice to use for the device type(device mapping), e.g.("NVIDIA_GPU", "INTEL_GPU", "AMD_GPU").
8080
```
8181
>> with tf.device("/gpu:0"):
8282
...
8383
>> with tf.device("/xpu:0"):
8484
...
8585
```
86-
* **Front-end mirroring**
87-
In the case of two GPUs in the same system, e.g. NVIDIA GPU + INTEL GPU and installing the Intel GPU plugin.
88-
* **Option 1**
89-
Only plugged gpu device is visible, PluggableDevice(INTEL GPU) overrides the default GPUDevice(NVIDIA GPU). If user want to use NVIDIA GPU, he needs to manually uninstall the plugin.
86+
* **Device mapping**
87+
In the case of two GPUs in the same system, e.g. NVIDIA GPU + X GPU and installing the X GPU plugin.
88+
* **Option 1**: override CUDA device, only plugged gpu device visible
89+
Only plugged gpu device is visible, PluggableDevice(X GPU) overrides the default GPUDevice(CUDA GPU). If users want to use CUDA GPU, they need to manually uninstall the plugin.
9090
```
9191
>> gpu_device = tf.config.experimental.list_physical_devices(`GPU`)
9292
>> print(gpu_device)
93-
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `INTEL_GPU`]
93+
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `X_GPU`]
9494
>> with tf.device("/gpu:0"):
95-
>> .. // place ops on PluggableDevice(Intel GPU)
95+
>> .. // place ops on PluggableDevice(X GPU)
96+
```
97+
* **Option 2**: both visible, but plugged gpu is default, user can set the device mapping
98+
Both plugged gpu device and default gpu device are visible, but only one gpu can work at the same time, plugged gpu device is default enabled(with higher priority), if users want to use NVIDIA GPU, they need to call device mapping API(set_sub_device_mapping()) to switch to CUDA device.
9699
```
97-
* **Option 2**
98-
Both plugged gpu device and default gpu device are visible, but only one gpu can work at the same time, plugged gpu device is default enabled, if user want to use NVIDIA GPU, he need to call mirroring API(set_sub_device_mapping()) to switch to NVIDIA gpu device.
99-
```
100100
>> gpu_device = tf.config.experimental.list_physical_devices(`GPU`)
101101
>> print(gpu_device)
102-
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `INTEL_GPU`, enabled]
103-
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `NVIDIA_GPU`, not-enabled]
102+
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `X_GPU`, enabled]
103+
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `NVIDIA_GPU`, disabled]
104+
104105
>> tf.config.set_subdevice_mapping("NVIDIA_GPU")
106+
>> gpu_device = tf.config.experimental.list_physical_devices(`GPU`)
107+
>> print(gpu_device)
108+
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `X_GPU`, disabled]
109+
[PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `NVIDIA_GPU`, enabled]
110+
105111
>> with tf.device("/gpu:0"):
106112
>> .. // place ops on GPUDevice(NVIDIA GPU)
107-
```
113+
```
108114
* **Physical device name**
109115
physical device name is user visible. User can query the physical device name(e.g. "Titan V") for the specified device instance through [tf.config.experimental.get_device_details()](https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_device_details).
110116
```
@@ -114,18 +120,17 @@ This section describes the front-end mirroring mechanism for python users, point
114120
>> print(details.get(`device_name`))
115121
"TITAN_V, XXX"
116122
```
117-
118123
### Device Discovery
119124
120125
Upon initialization of TensorFlow, it uses platform independent `LoadLibrary()` to load the dynamic library. The plugin library should be installed to the default plugin directory "…python_dir.../site-packages/tensorflow-plugins". The modular tensorflow [RFC](https://github.com/tensorflow/community/pull/77) describes the process of loading plugins.
121126
122-
During the plugin library initialization, TensorFlow proper calls the `SE_InitializePlugin` API (part of StreamExecutor C API) to retrieve nescessary informations from the plugin to instantiate a StreamExecutor platform([se::platform](https://github.com/tensorflow/tensorflow/blob/cb32cf0f0160d1f582787119d0480de3ba8b9b53/tensorflow/stream_executor/platform.h#L93) class) and registers the platform to a global object [se::MultiPlatformManager](https://github.com/tensorflow/tensorflow/blob/cb32cf0f0160d1f582787119d0480de3ba8b9b53/tensorflow/stream_executor/multi_platform_manager.h#L82), TensorFlow proper gets a device type and a subdevice type from plugin through `SE_InitializePlugin` and then registers the `PluggableDeviceFactory`with the registered device type. The device type string will be used to access PluggableDevice with tf.device() in python layer. The subdevice type is used for low-level specialization of GPU device(kernel, StreamExecutor, common runtime, grapper, placer..). If the user cares whether he is running on Intel/NVIDIA GPU, he can call python API (such as `tf.config.list_physical_devices`) to get the subdevice type for identification. user can also use `tf.config.get_device_details` to get the real device name(e.g. "TITAN V")for the specified device.
127+
During the plugin library initialization, TensorFlow proper calls the `SE_InitializePlugin` API (part of StreamExecutor C API) to retrieve nescessary informations from the plugin to instantiate a StreamExecutor platform([se::platform](https://github.com/tensorflow/tensorflow/blob/cb32cf0f0160d1f582787119d0480de3ba8b9b53/tensorflow/stream_executor/platform.h#L93) class) and registers the platform to a global object [se::MultiPlatformManager](https://github.com/tensorflow/tensorflow/blob/cb32cf0f0160d1f582787119d0480de3ba8b9b53/tensorflow/stream_executor/multi_platform_manager.h#L82), TensorFlow proper gets a device type and a subdevice type from plugin through `SE_InitializePlugin` and then registers the `PluggableDeviceFactory`with the registered device type. The device type string will be used to access PluggableDevice with tf.device() in python layer. The subdevice type is used for low-level specialization of GPU device(kernel, StreamExecutor, common runtime, grapper, placer..). If the users care whether they are running on NVIDIA GPU or X GPU(Intel GPU, Amd GPU..), they can call python API (such as `tf.config.list_physical_devices`) to get the subdevice type for identification. user can also use `tf.config.get_device_details` to get the real device name(e.g. "TITAN V")for the specified device.
123128
Plugin authors need to implement `SE_InitializePlugin` and provide the necessary informations:
124129
```cpp
125130
void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status) {
126131
int32_t visible_device_count = get_plugin_device_count();
127132
128-
std::string name = "My_GPU"; //StreamExecutor platform name && subdevice type
133+
std::string name = "X_GPU"; //StreamExecutor platform name && subdevice type
129134
std::string type = "GPU"; // device type
130135
131136
params.params.id = plugin_id_value;
@@ -277,20 +282,20 @@ Plugin authors need to provide those C functions implementation defined in Strea
277282
### PluggableDevice kernel registration
278283
279284
This section shows an example of kernel registration for PluggableDevice. Kernel registration and implementation API is addressed in a separate [RFC](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md).
280-
To avoid kernel registration conflict with existing GPU(CUDA) kernels, plugin author needs to provide a device type(such as "GPU") as well as a subdevice type(such as "INTEL_GPU") to TensorFlow proper for kernel registration and dispatch. The device type indicates the device the kernel runs on, the subdevice type is for low-level specialization of the device.
285+
To avoid kernel registration conflict with existing GPU(CUDA) kernels, plugin author needs to provide a device type(such as "GPU") as well as a subdevice type(such as "X_GPU") to TensorFlow proper for kernel registration and dispatch. The device type indicates the device the kernel runs on, the subdevice type is for low-level specialization of the device.
281286
```cpp
282287
void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status) {
283288
...
284289
std::string type = "GPU" // front-end visible device type
285290
params.params.type = type.c_str();
286-
std::string name = "INTEL_GPU"; // low-level specialization device type
291+
std::string name = "X_GPU"; // low-level specialization device type
287292
params.params.type = name.c_str();
288293
...
289294
}
290295
291296
void InitKernelPlugin() {
292297
TF_KernelBuilder* builder = TF_NewKernelBuilder(/*op_name*/"Convolution", "GPU", //"GPU" is device type
293-
"INTEL_GPU", &Conv_Create, &Conv_Compute, &Conv_Delete); // "INTEL_GPU" is sub device type
298+
"X_GPU", &Conv_Create, &Conv_Compute, &Conv_Delete); // "X_GPU" is sub device type
294299
TF_Status* status = TF_NewStatus();
295300
TF_RegisterKernelBuilder(/*kernel_name*/"Convolution", builder, status);
296301
if (TF_GetCode(status) != TF_OK) { /* handle errors */ }
Loading
Loading
Loading

0 commit comments

Comments
 (0)