libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member.
	(nvptx_open_device): Remove async_engines query, last used in
	r10-304-g1f4c5b9b; add unified-address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.
This commit is contained in:
Tobias Burnus
2022-10-24 16:58:43 +02:00
parent a096036589
commit 131d18e928
11 changed files with 247 additions and 17 deletions
+3
View File
@@ -77,6 +77,7 @@ typedef enum {
CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
} CUdevice_attribute;
@@ -113,6 +114,7 @@ enum {
#define CU_LAUNCH_PARAM_END ((void *) 0)
#define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
#define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
enum {
CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
CUresult cuMemAlloc (CUdeviceptr *, size_t);
#define cuMemAllocHost cuMemAllocHost_v2
CUresult cuMemAllocHost (void **, size_t);
CUresult cuMemHostAlloc (void **, size_t, unsigned int);
CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
#define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);