OneAPI Intel Experimental Matrix Multiplication Implementation

Introduction to OneAPI

oneAPI is a cross-platform, open and unified programming model designed to simplify and accelerate heterogeneous computing. It is led by Intel Corporation and supported by a wide range of partners. The goal of oneAPI is to provide a unified programming environment that enables developers to write high-performance applications on different hardware architectures, including CPUs, GPUs, FPGAs, and other accelerators.

oneAPI adopts a standards-based programming model and provides support for multiple programming languages, including C++, DPC++ (Data Parallel C++) and SYCL (Single-source C++ Heterogeneous Language). These languages provide rich parallel programming capabilities, allowing developers to effectively leverage different types of hardware to accelerate applications.

The core of oneAPI is its programming model and a set of open APIs (application programming interfaces) that cover various computing fields, including data parallelism, graph computing, machine learning, and deep learning. By using these APIs, developers can take advantage of the parallel computing capabilities of hardware to improve application performance and efficiency.

In addition to the programming model and API, oneAPI also provides a series of tools and libraries for optimizing and debugging applications. These tools and libraries can help developers analyze and optimize code to fully utilize the potential of the hardware.

In summary, oneAPI is a cross-platform programming model designed to simplify and accelerate heterogeneous computing. It provides a unified programming environment and an open set of APIs that enable developers to effectively leverage different types of hardware to accelerate applications. By using oneAPI, developers can more easily write high-performance parallel applications and make full use of the computing power of the hardware.

Software and hardware environment used in the experiment

Use the Intel oneAPI Developer Cloud service to directly utilize the CPU and GPU hardwarein the Developer Cloud platform.

Experimental task

Log in to the SYCL programming environment:
1. Enter oneapi/essential 2. Enter 02-SYCL-Program-Structure
Technology stack
Buffer:
In OneAPI, a buffer is a container used to store data. It provides an abstract view that allows you to create, manage, and access data in memory. Buffers can be local, global, or distributed, depending on your use case. Using buffers, you can perform various operations such as reading, writing, copying, and mapping data.
Queue:
A queue is a data structure used to store pending tasks or messages. In OneAPI, queues are used to manage asynchronous operations and parallel tasks. You can use queues to submit jobs, add events, or send messages. Queues provide an efficient mechanism for coordinating work across multiple processors and ensuring that tasks are executed in a specific order.
Processor handle (Handler):
A processor handle is an object that represents a specific processor or computing resource. In OneAPI, processor handles provide an abstract representation for accessing and performing parallel operations. Processor handles allow you to configure your computing environment, distribute workloads, execute parallel algorithms, and more. Processor handles also provide a mechanism for managing and monitoring the status and performance of the processor.
Accessor:
An accessor is a tool for accessing and manipulating data objects. In OneAPI, accessors provide a set of functions and operators for reading and writing buffers and arrays. Accessors support various data types, including integers, floating point numbers, structures, etc. By using accessors, you can easily read and modify elements of a data object without having to worry about the underlying memory layout and data layout.

Task two:

Follow the following steps to write a SYCL program to implement matrix multiplication involving two matrices:

Construct two sequences (vectors) of floating point numbers. The specification of each sequence is N (such as N=1024*1024) to form a matrix input value. Initialize the sequence with random values. Use cache and storage to allocate and run matrix memory on the device (GPU). Run SYCL Kernel to implement parallel operations on two matrices. Here you need to use the SYCL nd_range concept to define the operating range of the Kernel. Use the SYCL queuing system to run the Kernel on the device. After the Kernel has finished running, storage is used to retrieve the results from the device (GPU) back to the host.

Verify that your program is correct by comparing it to the preset results on the host computer.

Code explanation:

part1:

Simply use the CustomDiviceSelector given in the example code, which assigns a rating to the device by evaluating the device’s type and vendor. The rating rules are as follows:

If the device is a GPU and the vendor name contains the specified vendor name (passed in the constructor), the rating is 3.

If the device is a GPU but the vendor name does not contain the specified vendor name, the rating is 2.

If the device is a CPU, the rating is 1.

class CustomDeviceSelector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device & amp;dev) {
    int device_rating = 0;
    if (dev.is_gpu() & amp; (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

Part2: Create three 1024*1024 two-dimensional arrays of float type, and use a random number function to assign initial values to matrix1 and matrix2. The value range is between 0 and 2.

//Two-dimensional array
    vector<vector<float>> matrix1(N, vector<float>(N));
    vector<vector<float>> matrix2(N, vector<float>(N));
    vector<vector<float>> matrix3(N, vector<float>(N));

    //initialization
    random_device rd;
    mt19937 rng(rd());
    uniform_int_distribution<int> dist(0, 2);

    for (size_t i = 0; i < N; i + + ) {
        for (size_t j = 0; j < N; j + + ) {
            matrix1[i][j] = dist(rng);
            matrix2[i][j] = dist(rng);
        }
    }

Part3: Create a buffer for transmitting data between the host and the device, define the string vendor_name for the selected device vendor name, create three accessors, and use h.parallel_for to traverse the two-dimensional range {N, N} in parallel. Each work item block size is {64, 64}. In each work item do the following: 1. Get the global ID and use it as a matrix index. 2. Calculate the result of matrix multiplication in the loop and store it in the output matrix M3.

//Create buffers
    buffer<float, 2> Matrix1_buffer(reinterpret_cast<float*>(matrix1.data()), range<2>(N, N));
    buffer<float, 2> Matrix2_buffer(reinterpret_cast<float*>(matrix2.data()), range<2>(N, N));
    buffer<float, 2> Output_buffer(reinterpret_cast<float*>(matrix3.data()), range<2>(N, N));

queue q(selector);
    q.submit([ & amp;](handler & amp;h) {
      //# Create accessors for buffers
      accessor M1 (Matrix1_buffer,h,read_only);
      accessor M2 (Matrix2_buffer,h,read_only);
      accessor M3 (Output_buffer,h,write_only);

      h.parallel_for(nd_range<2>({N, N}, {64, 64}), [=](nd_item<2> item) {
          //# Multiplication
          size_t row = item.get_global_id(0);
          size_t col = item.get_global_id(1);
          for (size_t k = 0; k < N; + + k) {
              M3[row][col] + = M1[row][k] * M2[k][col];
          }
        });
    });

Source code

Task 2

%%writefile lab/vector_add.cpp
//================================================ ==============
// Copyright ? Intel Corporation
//
// SPDX-License-Identifier: MIT
// ================================================ =============
#include 
#include 
#include 
#include 

using namespace sycl;
using namespace std;

class CustomDeviceSelector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device & amp;dev) {
    int device_rating = 0;
    if (dev.is_gpu() & amp; (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

int main() {
    const int N = 1024;
    
    //Two-dimensional array
    vector<vector<float>> matrix1(N, vector<float>(N));
    vector<vector<float>> matrix2(N, vector<float>(N));
    vector<vector<float>> matrix3(N, vector<float>(N));

    //initialization
    random_device rd;
    mt19937 rng(rd());
    uniform_int_distribution<int> dist(0, 2);

    for (size_t i = 0; i < N; i + + ) {
        for (size_t j = 0; j < N; j + + ) {
            matrix1[i][j] = dist(rng);
            matrix2[i][j] = dist(rng);
        }
    }
    
    cout<<"\
matrix1:\
";
     for (size_t i=0;i Matrix1_buffer(reinterpret_cast(matrix1.data()), range<2>(N, N));
    buffer Matrix2_buffer(reinterpret_cast(matrix2.data()), range<2>(N, N));
    buffer Output_buffer(reinterpret_cast(matrix3.data()), range<2>(N, N));


    //Choose device
    std::string vendor_name = "Intel";
    CustomDeviceSelector selector(vendor_name);
    

    //Submit task to multiply matrices
    queue q(selector);
    q.submit([ & amp;](handler & amp;h) {
      //# Create accessors for buffers
      accessor M1 (Matrix1_buffer,h,read_only);
      accessor M2 (Matrix2_buffer,h,read_only);
      accessor M3 (Output_buffer,h,write_only);

      h.parallel_for(nd_range<2>({N, N}, {64, 64}), [=](nd_item<2> item) {
          //# Multiplication
          size_t row = item.get_global_id(0);
          size_t col = item.get_global_id(1);
          for (size_t k = 0; k < N; + + k) {
              M3[row][col] + = M1[row][k] * M2[k][col];
          }
        });
    });


    //Create accessor
    host_accessor h_a(Output_buffer, read_write);

    
    cout << "\
matrix3:\
";
    for (size_t i = 0; i < N; i + + ) {
        for (size_t j = 0; j < N; j + + ) {
            cout << matrix3[i][j] << " ";
        }
        cout << "\
";
    }
    return 0;
}

Task 3: Unified shared memory provides a unified storage model between the host and the device (GPU). Modify your program to use unified shared memory for memory allocation and data transformation, instead of caching and storage. Bonus points if different kinds of USM are used. Rewrite the code to use unified shared memory for memory allocation and data conversion (if necessary), and run the Kernel as usual to verify the program results. (Similar to Task 2, add the buffer constructor attribute {property::buffer::use_host_ptr{}} to create a buffer using unified shared memory.)

%%writefile lab/vector_add.cpp
//================================================ ==============
// Copyright ? Intel Corporation
//
// SPDX-License-Identifier: MIT
// ================================================ =============
#include 
#include 
#include 
#include 

using namespace sycl;
using namespace std;

class CustomDeviceSelector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device & amp;dev) {
    int device_rating = 0;
    if (dev.is_gpu() & amp; (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

int main() {
    const int N = 1024;
    
    //Two-dimensional array
    vector> matrix1(N, vector(N));
    vector> matrix2(N, vector(N));
    vector> matrix3(N, vector(N));

    //initialization
    random_device rd;
    mt19937 rng(rd());
    uniform_int_distribution dist(0, 2);

    for (size_t i = 0; i < N; i + + ) {
        for (size_t j = 0; j < N; j + + ) {
            matrix1[i][j] = dist(rng);
            matrix2[i][j] = dist(rng);
        }
}
cout<<"matrix1:\
";
for (size_t i=0;i Matrix1_buffer(reinterpret_cast(matrix1.data()), range<2>(N, N));
    buffer Matrix2_buffer(reinterpret_cast(matrix2.data()), range<2>(N, N));
    buffer Output_buffer(reinterpret_cast(matrix3.data()), range<2>(N, N), {property::buffer::use_host_ptr{}});

    //Choose device
    std::string vendor_name = "Intel";
    CustomDeviceSelector selector(vendor_name);
    
    //Submit task to multiply matrices
    queue q(selector);
    q.submit([ & amp;](handler & amp;h) {
      //# Create accessors for buffers
      accessor M1 (Matrix1_buffer,h,read_only);
      accessor M2 (Matrix2_buffer,h,read_only);
      accessor M3 (Output_buffer,h,write_only);

      h.parallel_for(nd_range<2>({N, N}, {16, 16}), [=](nd_item<2> item) {
          //# Multiplication
          size_t row = item.get_global_id(0);
          size_t col = item.get_global_id(1);
          for (size_t k = 0; k < N; + + k) {
              M3[row][col] + = M1[row][k] * M2[k][col];
          }
        });
    });

//Create a host accessor
    host_accessor h_a(Output_buffer, read_only);
    cout << "\
matrix3:\
";
    for (size_t i = 0; i < N; i + + ) {
        for (size_t j = 0; j < N; j + + ) {
            cout << matrix3[i][j] << " ";
        }
        cout << "\
";
    }
    return 0;
}

Task 4: Analyze running time, task 1 is 22sec, task 3 is 24sec

Run results

Experimental reflection

(1) Note that when using nd-range to divide work groups, you must ensure that the size of the work group is appropriate. During the experiment, the work group size was set to 64*64 at the beginning, and an nd-range error occurred. The actual SYCL The maximum allowed workgroup range is 512, so the modified size is 16*16. It should be noted that the size of the workgroup must be divisible by N

(2) Pay attention to the problem that N is too large and the output is limited

(3) Understand through experiments the basic concepts of the host and device coordinating work to complete parallel computing

(4) Problems and challenges faced: For the problem that N is too large, it is difficult to adjust the problem of Notebook sending data to IOpub too fast, so that problems arise when outputting a 1024*1024 matrix; for nd_range Problem, there is a problem when modifying the size of the workgroup. For different sizes of N, the size of the workgroup should be set reasonably, otherwise an error will occur during operation; for the segmentation fault that occurs, it may also be a problem with the size setting of the nd_range workgroup.