OpenMP shared memory multi-core parallel computing

1. Reference materials

openMP_demo
Getting started with OpenMP
OpenMP Tutorial (1) In-depth analysis of the OpenMP reduction clause

2. Introduction to OpenMP

1. Introduction to OpenMP

OpenMP (Open Multi-Processing) is a multi-threaded programming solution for shared memory parallel systems, supporting C/C++. OpenMP provides a high-level abstract description of parallel algorithms, which can implement parallel computing in multiple processor cores and improve program execution efficiency. The compiler automatically processes the program in parallel according to the pragma instructions added to the program. Using OpenMP reduces the difficulty and complexity of parallel programming. When the compiler does not support OpenMP, the program will degenerate into a normal (serial) program, and the existing OpenMP instructions in the program will not affect the normal compilation and operation of the program.

Many mainstream compilation environments have built-in OpenMP. In Visual Studio, starting OpenMP is very simple. Right-click on the project->Properties->Configuration Properties->C/C + ±>Language->OpenMP support and select “Yes”.

2. Shared memory model

OpenMP is designed for multi-processor and multi-core shared memory machines. The number of processing units (CPU cores) determines the parallelism of OpenMP.

3. Hybrid parallel model

OpenMP is suitable for single-node parallelism, and MPI is combined with OpenMP to achieve distributed memory parallelism, which is often called a hybrid parallel model.

  • OpenMP is used for computationally intensive work on each node (one computer);
  • MPI is used to implement communication and data sharing between nodes.

4. Fork-JoinModel

OpenMP uses the Fork-Join model of parallel execution.

  • Fork: The main thread creates a set of parallel threads;
  • Join: The team threads perform calculations separately in the parallel area. They will be synchronized and terminated, leaving only the main thread.

In the above figure, parallel region is a parallel region. Multi-threads run concurrently in the parallel region, and linear execution is performed by the main thread (master) between parallel regions.

5. barrierSynchronization mechanism

barrier is used for thread synchronization of code in the parallel domain. When the thread reaches barrier, it must stop and wait until all threads have executed barrier. Then continue execution to achieve thread synchronization.

#include <stdio.h>

int main(void)
{<!-- -->
int th_id, nthreads;

#pragma omp parallel private(th_id)
{<!-- -->
th_id = __builtin_omp_get_thread_num();
printf("Hello World from thread %d\
", th_id);

#pragma omp barrier
if (th_id == 0) {<!-- -->
nthreads = __builtin_omp_get_num_threads();
printf("There are %d threads\
", nthreads);
}
}
return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo
Hello World from thread 10
Hello World from thread 3
Hello World from thread 2
Hello World from thread 6
Hello World from thread 4
Hello World from thread 7
Hello world from thread 0
Hello World from thread 5
Hello World from thread 11
Hello World from thread 8
Hello world from thread 1
Hello World from thread 9
There are 12 threads

3. Common operations

1. Common commands

# Install OpenMP
sudo apt-get install libomp-dev

# Use gcc to compile OpenMP programs
gcc -fopenmp demo.c -o demo

# Compile OpenMP program using g++
g++ -fopenmp demo.cpp -o demo

2. Important operations

(1) Parallel area: Use the #pragma omp parallel directive to define the parallel area.

(2) Thread number: Use the omp_get_thread_num() function to obtain the number of the current thread.

(3) Total number of threads: Use the omp_get_num_threads() function to obtain the total number of threads.

(4) Data sharing: You can use keywords such as private and shared to declare the shared status of variables.

(5) Synchronization mechanism: You can use the #pragma omp barrier instruction to implement thread synchronization.

3. Check whether OpenMP is supported

#include <stdio.h>

int main()
{<!-- -->
    #if _OPENMP
        printf("support openmp\
");
    #else
        printf("not support openmp\
");
    #endif
    return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo-1
support openmp

4. Hello World

#include <stdio.h>

int main(void)
{<!-- -->
#pragma omp parallel
{<!-- -->
printf("Hello, world. \
");
}
\t
return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.
Hello, world.

Since the number of threads is not specified, the default number is the number of CPU cores.

#include <stdio.h>

int main(void)
{<!-- -->
    //Specify the number of threads
#pragma omp parallel num_threads(6)
{<!-- -->
printf("Hello, world. \
");
}
\t
return 0;
}

5. #pragma omp parallel for

omp_get_thread_num: Get the current thread id;

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(void) {<!-- -->

#pragma omp parallel for
for (int i=0; i<12; i + + ) {<!-- -->
printf("OpenMP Test, th_id: %d\
", omp_get_thread_num());
}

return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo
OpenMP Test, th_id: 8
OpenMP Test, th_id: 3
OpenMP Test, th_id: 1
OpenMP Test, th_id: 9
OpenMP Test, th_id: 5
OpenMP Test, th_id: 0
OpenMP Test, th_id: 6
OpenMP Test, th_id: 11
OpenMP Test, th_id: 2
OpenMP Test, th_id: 7
OpenMP Test, th_id: 4
OpenMP Test, th_id: 10

6. reductionReduction operation

6.1 Introduction

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(void) {<!-- -->
\t
int sum = 0;
#pragma omp parallel for
for (int i=1; i<=100; i + + ) {<!-- -->
sum + = i;
}
printf("%d", sum);
return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo
1173yoyo@yoyo:~/PATH/TO$ ./demo
2521yoyo@yoyo:~/PATH/TO$ ./demo
3529yoyo@yoyo:~/PATH/TO$ ./demo
2174yoyo@yoyo:~/PATH/TO$ ./demo
1332yoyo@yoyo:~/PATH/TO$ ./demo
1673yoyo@yoyo:~/PATH/TO$ ./demo
1183yoyo@yoyo:~/PATH/TO$

Executed multiple times, the results are different each time because threads compete for the same resource. For the line sum + = i;, it can be rewritten as sum = sum + i. Multi-threads will write sum simultaneously, causing conflicts. To solve this problem, reduction can be used.

6.2 reductionIntroduction

reduction(operator: variable)

Let’s take the sum summation function as an example:

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
int main(void) {<!-- -->
\t
int sum = 0;
#pragma omp parallel for reduction( + :sum)
for (int i=1; i<=100; i + + ) {<!-- -->
sum + = i;
}
printf("%d", sum);
return 0;
}
yoyo@yoyo:~/PATH/TO$ gcc -fopenmp demo.c -o demo
yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$ ./demo
5050yoyo@yoyo:~/PATH/TO$

In the above code, reduction( + :sum) means copying the variable sum in each thread, and then using this copied variable in the thread. In this way, there is no data competition problem, because for each thread Each thread uses different sum data. There is also a plus sign + in reduction. This plus sign indicates how to perform reduction operation. The so-called reduction operation simply means that multiple data are gradually manipulated, and finally a data that cannot be reduced is obtained.

For example, in the above program, the reduction operation is + , so the data of thread 1 and thread 2 need to be operated by + , that is, the sum value of thread 1 plus the thread The sum value of 2 is then assigned to the global variable sum, and so on for other threads. The final global variable sum is the correct result.

If there are 4 threads, then there are 4 thread-local sums, and each thread copies the sum. Then the result of the reduction operation is equal to:

(

(

(

s

u

m

1

+

s

u

m

2

)

+

s

u

m

3

)

+

s

u

m

4

)

(((sum_1 + sum2) + sum_3) + sum_4)

(((sum1? + sum2) + sum3?) + sum4?)
Among them, sum_i represents the sum obtained by the i-th thread.