Intel? DevCloud for oneAPI SYCL programming project practice

Problem statement

Hardware environment and software environment used in the experiment

This experiment uses the C++ compiler in oneAPI that supports the SYCL programming model. Using the Intel oneAPI Developer Cloud service, you can use the CPU as the host (Host) and the GPU as the device (Device) to complete the job without installing additional environments.

Output GPU device name via queue q:

Intel? DevCloud programming environment registration

Register and log in to your Intel account

Enter the position shown in the picture and scroll down to the bottom

Connect and enter the jupyter lab environment

Enter the directory pointed by the arrow

In this directory you can learn official documents related to sycl and oneapi

The environment of this experiment uses the Lab Exercise: Vector Add experiment of SYCL_Program_Structure.ipynb in the oneAPI_Essentials/02_SYCL_Program_Structure/ directory. You can directly write your own code and run it in this location without additional installation operations.

How to obtain the running results

Fill in the code in the cell of the above Vector Add

run

At this time the code will be saved in a cpp file in another location:

Then run the following cell

Run matrix multiplication on Dev C++ platform

code show as below:

#include<vector>
#include <chrono>
#include <iostream>
using namespace std;
//Calculation on cpu
double cpu_kernel(std::vector<float> & amp;A, std::vector<float> & amp;B, std::vector<float> & amp;C, int M, int N, int K) {
    double duration = 0.0;
    std::chrono::high_resolution_clock::time_point s, e;
    s = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < M; i + + ) {
        for(int j = 0; j < N; j + + ) {
            float sum = 0.0f;
            for(int k = 0; k < K; k + + ) {
                sum + = A[i * K + k] * B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
    e = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<float, std::milli>(e - s).count();
    return duration;
}
void compute(const int M,const int N,const int K,const int block_size,const int iterations) {
  //matrix A:(1024*1024) matrix B:(1024*1024) matrix C=A*B(1024*1024)
  cout << "C matrix size: c(" << M << "," << N << ") ="<< "A matrix a(" << M << \ "," << K << ") *"<< " B matrix b(" << K << "," << N << ")\\
";

  std::vector<float> A(N*K);
  std::vector<float> B(K*M);
  std::vector<float> C(N*M);
  for(int i=0; i < M * K; i + + ) {
      A[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < K * N; i + + ) {
      B[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < M * N; i + + ) {
      C[i] = 0.0f;
  }
    //CPU average execution time
  double duration_cpu = 0.0f;
  int warmup = 2;
  for(int run = 0; run < iterations + warmup; run + + ) {
      float duration = cpu_kernel(A, B, C, M, N, K);
      if(run >= warmup) duration_cpu + = duration;
  }
  duration_cpu = duration_cpu / iterations;
  //Run time output
  printf("\\
CPU calculation time: %lf (ms); \\
",duration_cpu);
}
int main() {
  compute(1024,1024,1024, 4, 5);
  return 0;
}

operation result

Use cache storage to implement matrix multiplication

In the buffer model, there are two ways to use cache storage to retrieve results from the device (GPU) back to the host. The first is to use host_accessor, and the second is to utilize the buffer destruction mechanism.

buffer_destruction mechanism: Buffer creation occurs in a separate function scope. When execution exceeds the scope of this function, the buffer destructor is called, thereby relinquishing ownership of the data and copying the data back to host memory

Complete code and comments:

#include <chrono>
#include<vector>
#include <iostream>
#include <CL/sycl.hpp>
using namespace std;
using namespace sycl;
//Computation on gpu
double gpu_kernel(std::vector<float> & amp;A, std::vector<float> & amp;B, std::vector<float> & amp;C, int M, int N, int K, int block_size , sycl::queue & amp;q) {
  //Make sure the number of rows and columns is evenly divisible by the block size block_size
  auto grid_rows = (M + block_size - 1) / block_size * block_size;
  auto grid_cols = (N + block_size - 1) / block_size * block_size;
  //The local scope determines the size of each work group
  auto local_ndrange = range<2>(block_size, block_size);
  //The global scope defines the scope of the entire calculation
  auto global_ndrange = range<2>(grid_rows, grid_cols);
  //buffer
  buffer buf1(A);
  buffer buf2(B);
  buffer buf3(C);
  double duration = 0.0f;
  auto e = q.submit([ & amp;](sycl::handler & amp;h) {
  //Accessor
  accessor A1(buf1,h);
  accessor B1(buf2,h);
  accessor C1(buf3,h);
      h.parallel_for<class k_name_t>(
          sycl::nd_range<2>(global_ndrange, local_ndrange), [=](sycl::nd_item<2> index) {
          //Get the row and column number through index
              int row = index.get_global_id(0);
              int col = index.get_global_id(1);
              float sum = 0.0f;
              for (int i = 0; i < K; i + + ) {
                sum + = A1[row * K + i] * B1[i * N + col];
              }
              C1[row * N + col] = sum;
          });
    });
    e.wait();
    duration + = (e.get_profiling_info<info::event_profiling::command_end>() -
    e.get_profiling_info<info::event_profiling::command_start>()) /1000.0f/1000.0f;
    return duration;
}
//Calculation on cpu
double cpu_kernel(std::vector<float> & amp;A, std::vector<float> & amp;B, std::vector<float> & amp;C, int M, int N, int K) {
    double duration = 0.0;
    std::chrono::high_resolution_clock::time_point s, e;
    s = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < M; i + + ) {
        for(int j = 0; j < N; j + + ) {
            float sum = 0.0f;
            for(int k = 0; k < K; k + + ) {
                sum + = A[i * K + k] * B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
    e = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<float, std::milli>(e - s).count();
    return duration;
}
//Verify the deviation between cpu calculation results and gpu calculation results
int verify(std::vector<float> & amp;cpu_res, std::vector<float> & amp;gpu_res, int length){
    int err = 0;
    for(int i = 0; i < length; i + + ) {
       if( fabs(cpu_res[i] - gpu_res[i]) > 1e-3) {
          err + + ;
          printf("\\
%lf, %lf", cpu_res[i], gpu_res[i]);
       }
    }
    return err;
}
int compute(const int M,const int N,const int K,const int block_size,const int iterations,sycl::queue & amp;q) {
  //matrix A:(1024*1024) matrix B:(1024*1024) matrix C=A*B(1024*1024)
  cout << "C matrix size: c(" << M << "," << N << ") ="<< "A matrix a(" << M << \ "," << K << ") *"<< " B matrix b(" << K << "," << N << ")\\
";

  std::vector<float> A(N*K);
  std::vector<float> B(K*M);
  std::vector<float> C(N*M);
  std::vector<float> C2(N*M);
  for(int i=0; i < M * K; i + + ) {
      A[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < K * N; i + + ) {
      B[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < M * N; i + + ) {
      C[i] = 0.0f;
      C2[i] = 0.0f;
  }
    //GPU average execution time
  double duration_gpu = 0.0f;
    //CPU average execution time
  double duration_cpu = 0.0f;
    //The number of code "warm-ups", that is, a total of 20 calculations, but only the average running time of 11-20 times is counted.
  int warmup = 10;
  for (int run = 0; run < iterations + warmup; run + + ) {
    float duration = gpu_kernel(A, B, C, M, N, K, block_size, q);
    if(run >= warmup) duration_gpu + = duration;
  }
  //Calculate average gpu running time
  duration_gpu = duration_gpu / iterations;
  //Calculate the average running time of the CPU. The CPU runs slowly, so the iteration value is half of the original one.
  warmup = 2;
  for(int run = 0; run < iterations/2 + warmup; run + + ) {
      float duration = cpu_kernel(A, B, C2, M, N, K);
      if(run >= warmup) duration_cpu + = duration;
  }
  duration_cpu = duration_cpu / iterations/2;
  //Error result verification
  int errCode = 0;
  errCode = verify(C2, C, M*N);
  printf("\\
There are %d errors in total\\
", errCode);
  //Run time output
  printf("\\
1\\
GPU calculation time: %lf (ms); \\
CPU calculation time: %lf (ms); \\
",duration_gpu, duration_cpu);
  return errCode;
}
int main() {
  //enable_profiling() is a SYCL property used to enable profiling on the SYCL queue
  auto propList = cl::sycl::property_list {cl::sycl::property::queue::enable_profiling()};
  //Queue q uses gpu device selector
  queue q( cl::sycl::gpu_selector_v , propList);
  int errCode = compute(1024,1024,1024, 4, 10, q);
  return errCode;
}

Code explanation

In addition to the main function, this code also contains four functions, namely gpu_kernel, cpu_kernel, verify, compute

Among them, gpu_kernel is responsible for the multiplication operation of the matrix on the gpu and the function of calculating the operation time, cpu_kernel is responsible for the multiplication operation of the matrix on the cpu and the function of calculating the operation time, verify is responsible for the comparison of the results of the matrix operation on the gpu and cpu, and compute is responsible for Matrix random initialization, calling the first three functions, output results, and many other functions

In the compute function, two parameters iteration and warmup are used

The former iteration is the average calculation time value of iteration times of gpu/cpu. Warmup plays a “warm-up” role. For example, when iteration=10 and warmup=10, the gpu will operate 20 times, and the final result will be 11-20 times. Calculate the average of time

operation result

Matrix multiplication using USM explicit data movement

Explicit movement of data refers to the USM implementation using malloc_device, where data movement between host and device can be done explicitly by the developer using memcpy

The following shows the matrix multiplication implementation code using this idea:

#include <chrono>
#include <iostream>
#include <CL/sycl.hpp>
using namespace std;
using namespace sycl;
//Verify the deviation between cpu calculation results and gpu calculation results
int verify(float *cpu_res, float *gpu_res, int length){
    int err = 0;
    for(int i = 0; i < length; i + + ) {
       if( fabs(cpu_res[i] - gpu_res[i]) > 1e-3) {
          err + + ;
          printf("\\
%lf, %lf", cpu_res[i], gpu_res[i]);
       }
    }
    return(err);
}
int main() {
  //enable_profiling() is a SYCL property used to enable profiling on the SYCL queue
  auto propList = cl::sycl::property_list {cl::sycl::property::queue::enable_profiling()};
  //Queue q uses gpu device selector
  queue q( cl::sycl::gpu_selector_v , propList);
  int block_size=4;
  int M=1024;
  int N=1024;
  int K=1024;
  //gpu calculation
   //Make sure the number of rows and columns is evenly divisible by the block size block_size
  auto grid_rows = (M + block_size - 1) / block_size * block_size;
  auto grid_cols = (N + block_size - 1) / block_size * block_size;
  //The local scope determines the size of each work group
  auto local_ndrange = range<2>(block_size, block_size);
  //The global scope defines the scope of the entire calculation
  auto global_ndrange = range<2>(grid_rows, grid_cols);
  //Data on the host
  float *A=static_cast<float *>(malloc(M*K*sizeof(float)));
  float *B=static_cast<float *>(malloc(K*N*sizeof(float)));
  float *C=static_cast<float *>(malloc(M*N*sizeof(float)));
  float *C2=static_cast<float *>(malloc(M*N*sizeof(float)));
  for(int i=0; i < M * K; i + + ) {
      A[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < K * N; i + + ) {
      B[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < M * N; i + + ) {
      C[i] = 0.0f;
      C2[i] = 0.0f;
  }
  //malloc_device: Data can be migrated explicitly between the device and the host
  float *A_device=malloc_device<float>(M*K,q);
  float *B_device=malloc_device<float>(K*N,q);
  float *C_device=malloc_device<float>(M*N,q);
  //Copy the data on the host to the device
  q.memcpy(A_device,A,sizeof(float)*M*K).wait();
  q.memcpy(B_device,B,sizeof(float)*K*N).wait();
  q.memcpy(C_device,C,sizeof(float)*M*N).wait();
  double duration=0.0;
    for(int i=0;i<20;i + + ){
        std::chrono::high_resolution_clock::time_point s, e;
        s = std::chrono::high_resolution_clock::now();
      auto e1 = q.submit([ & amp;](sycl::handler & amp;h) {
          h.parallel_for<class k_name_t>(
              sycl::nd_range<2>(global_ndrange, local_ndrange), [=](sycl::nd_item<2> index) {
                 //Get the row and column number through index
                  int row = index.get_global_id(0);
                  int col = index.get_global_id(1);
                  float sum = 0.0f;
                  for (int i = 0; i < K; i + + ) {
                    sum + = A_device[row * K + i] * B_device[i * N + col];
                  }
                  C_device[row * N + col] = sum;
              });
        });
        e1.wait();
        //Copy the processed C matrix data located on the device to C on the host
        q.memcpy(C,C_device,sizeof(float)*M*N).wait();
        e = std::chrono::high_resolution_clock::now();
        if(i>=10)duration = duration + std::chrono::duration<float, std::milli>(e - s).count();
    }
    printf("\\
GPU calculation time: %lf (ms); \\
",duration/10.0);
    duration = 0.0;
    for(int i=0;i<7;i + + ){
        std::chrono::high_resolution_clock::time_point s, e;
        s = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < M; i + + ) {
            for(int j = 0; j < N; j + + ) {
                float sum = 0.0f;
                for(int k = 0; k < K; k + + ) {
                    sum + = A[i * K + k] * B[k * N + j];
                }
                C2[i * N + j] = sum;
            }
        }
        e = std::chrono::high_resolution_clock::now();
        if(i>=2)duration = duration + std::chrono::duration<float, std::milli>(e - s).count();
    }
    printf("\\
CPU calculation time: %lf (ms); \\
",duration/5.0);
    //Error result verification
    int errCode = 0;
    errCode = verify(C2, C, M*N);
    printf("\\
There are %d errors in total\\
", errCode);
    free(A_device,q);
    free(B_device,q);
    free(C_device,q);
    free(A);
    free(B);
    free(C);
  return errCode;
}

Code explanation:

This code uses the USM explicit data movement scheme to implement matrix multiplication. It first creates the memory on the host and then uses malloc_device to create the memory on the device. After the matrix memory on the host is randomly assigned, the data on the host is transferred to the host through memcpy of queue q. Copy it to the memory of the device, submit the data to the accelerator for processing through the queue, and then reassign it to the memory of the host. Finally, verify the results through verify.

This code is also “code warmed up” before the CPU and GPU are calculated. After the CPU has run for 2 rounds and the GPU has run for 10 rounds, let the CPU run for 5 rounds, take the time average, and let the GPU run for 10 rounds. , take the time average

operation result:

Use USM implicit data movement to implement matrix multiplication

USM implicit data movement refers to the use of malloc_host and malloc_shared for memory allocation and management. The following demonstrates the code that uses this idea to implement matrix multiplication:

#include <chrono>
#include <iostream>
#include <CL/sycl.hpp>
using namespace std;
using namespace sycl;
//Computation on gpu
double gpu_kernel(float *A, float *B, float *C, int M, int N, int K, int block_size, sycl::queue & amp;q) {
  //Make sure the number of rows and columns is evenly divisible by the block size block_size
  auto grid_rows = (M + block_size - 1) / block_size * block_size;
  auto grid_cols = (N + block_size - 1) / block_size * block_size;
  //The local scope determines the size of each work group
  auto local_ndrange = range<2>(block_size, block_size);
  //The global scope defines the scope of the entire calculation
  auto global_ndrange = range<2>(grid_rows, grid_cols);
  double duration = 0.0f;
  auto e = q.submit([ & amp;](sycl::handler & amp;h) {
      h.parallel_for<class k_name_t>(
          sycl::nd_range<2>(global_ndrange, local_ndrange), [=](sycl::nd_item<2> index) {
          //Get the row and column number through index
              int row = index.get_global_id(0);
              int col = index.get_global_id(1);
              float sum = 0.0f;
              for (int i = 0; i < K; i + + ) {
                sum + = A[row * K + i] * B[i * N + col];
              }
              C[row * N + col] = sum;
          });
    });
    e.wait();
    duration + = (e.get_profiling_info<info::event_profiling::command_end>() -
    e.get_profiling_info<info::event_profiling::command_start>()) /1000.0f/1000.0f;
    return(duration);
}
//Calculation on cpu
double cpu_kernel(float *A, float *B, float *C, int M, int N, int K) {
    double duration = 0.0;
    std::chrono::high_resolution_clock::time_point s, e;
    s = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < M; i + + ) {
        for(int j = 0; j < N; j + + ) {
            float sum = 0.0f;
            for(int k = 0; k < K; k + + ) {
                sum + = A[i * K + k] * B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
    e = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<float, std::milli>(e - s).count();
    return(duration);
}
//Verify the deviation between cpu calculation results and gpu calculation results
int verify(float *cpu_res, float *gpu_res, int length){
    int err = 0;
    for(int i = 0; i < length; i + + ) {
       if( fabs(cpu_res[i] - gpu_res[i]) > 1e-3) {
          err + + ;
          printf("\\
%lf, %lf", cpu_res[i], gpu_res[i]);
       }
    }
    return(err);
}
int compute(const int M,const int N,const int K,const int block_size,const int iterations,sycl::queue & amp;q) {
  //matrix A:(1024*1024) matrix B:(1024*1024) matrix C=A*B(1024*1024)
  cout << "C matrix size: c(" << M << "," << N << ") ="<< "A matrix a(" << M << \ "," << K << ") *"<< " B matrix b(" << K << "," << N << ")\\
";
  //malloc_shared: Data can be implicitly migrated between the device and host
  auto A = malloc_shared<float>(M * K, q);
  auto B = malloc_shared<float>(K * N, q);
  auto C = malloc_shared<float>(M * N, q);
  //malloc_host: data is explicitly allocated on the host
  auto C2 = malloc_host<float>(M * N, q);
  for(int i=0; i < M * K; i + + ) {
      A[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < K * N; i + + ) {
      B[i] = rand() / double(RAND_MAX);
  }
  for(int i=0; i < M * N; i + + ) {
      C[i] = 0.0f;
      C2[i] = 0.0f;
  }
    //GPU average execution time
  double duration_gpu = 0.0f;
    //CPU average execution time
  double duration_cpu = 0.0f;
    //The number of code "warm-ups", that is, a total of 20 calculations, but only the average running time of 11-20 times is counted.
  int warmup = 10;
  for (int run = 0; run < iterations + warmup; run + + ) {
    float duration = gpu_kernel(A, B, C, M, N, K, block_size, q);
    if(run >= warmup) duration_gpu + = duration;
  }
  //Calculate average gpu running time
  duration_gpu = duration_gpu / iterations;
  //Calculate the average running time of the CPU. The CPU runs slowly, so the iteration value is half of the original one.
  warmup = 2;
  for(int run = 0; run < iterations/2 + warmup; run + + ) {
      float duration = cpu_kernel(A, B, C2, M, N, K);
      if(run >= warmup) duration_cpu + = duration;
  }
  duration_cpu = duration_cpu / iterations/2;
  //Error result verification
  int errCode = 0;
  errCode = verify(C_host, C, M*N);
  printf("\\
There are %d errors in total\\
", errCode);
  //Run time output
  printf("\\
\\
GPU calculation time: %lf (ms); \\
CPU calculation time: %lf (ms); \\
",duration_gpu, duration_cpu);
  //release memory
  free(A, q);
  free(B, q);
  free(C, q);
  free(C_host, q);
  return errCode;
}
int main() {
  //enable_profiling() is a SYCL property used to enable profiling on the SYCL queue
  auto propList = cl::sycl::property_list {cl::sycl::property::queue::enable_profiling()};
  //Queue q uses gpu device selector
  queue q( cl::sycl::gpu_selector_v , propList);
  int errCode = compute(1024,1024,1024, 4, 10, q);
  return errCode;
}

Code explanation:

Like the first code, in addition to the main function, this code also contains four functions, namely gpu_kernel, cpu_kernel, verify, and compute.

Among them, gpu_kernel is responsible for the multiplication operation of the matrix on the gpu and the function of calculating the operation time, cpu_kernel is responsible for the multiplication operation of the matrix on the cpu and the function of calculating the operation time, verify is responsible for the comparison of the results of the matrix operation on the gpu and cpu, and compute is responsible for Matrix random initialization, calling the first three functions, output results and many other functions

The difference is that this code uses molloc_shared and malloc_host to implement unified shared memory

operation result:

Comparative analysis of several operating results

Through the above different operations, it can be found that the devcloud platform has obvious high-performance advantages. In addition, among the three methods of devcloud, the advantages of USM can be summarized as follows:

SYCL* buffers are powerful, however, replacing all pointers and arrays with buffers in a c++ program may burden the programmer, so in this case you may consider using unified shared memory USM

1. When porting C++ code to sycl, you want to change as little code as possible

2. Use explicit USM allocation when you need to control data movement

Challenges encountered in the project, solutions and experiences

SYCL is a heterogeneous programming framework. For beginners who have never understood it, it is a challenge to understand the concepts and programming models of SYCL. In order to solve this problem, I carefully studied the audio ppt provided by the teacher and the oneapi official document The first four chapters mainly include three aspects: introduction to oneapi, sycl programming framework and sycl’s unified shared memory

In order to deepen the understanding of oneAPI, I wrote four summary csdn blogs based on the above learning content. The links are as follows:

oneAPI study notes 1: Introduction to SYCL

oneAPI study notes 2: jupyter official documentation (oneAPI_Intro) study notes

oneAPI study notes three: jupyter official document (SYCL Program Structure) study notes

oneAPI study notes four: jupyter official document (Unified Shared Memory) study notes

Despite this, learning sycl is still in the introductory stage and still faces many challenges, including but not limited to the challenge of dealing with different hardware compatibility, the challenge of how to optimize code on different devices, and some error handling specific to heterogeneous programming. challenges etc.

All in all, this oneAPI practice exposed me to Intel’s cutting-edge technology and benefited a lot. Conducting the SYCL experiment was a challenging but fulfilling process. During this process, I improved my ability to deal with an unfamiliar problem and at the same time exercised my ability to read and study English official technical documents. In addition, I was exposed to vocabulary that I had never heard of before, such as high-performance computing (HPC), heterogeneous programming, etc. With the accumulation of time and practice, I hope to become more proficient in applying SYCL and gain more gains and experience from it.