赵隐剑老师_并行编程入门实践课程 01¶

该笔记对应以下 b 站视频，主要讲 linux、vim、OpenMP 的使用
其它 OpenMP 教程
- OpenMP101
- OpenMP tutorial by Blaise Barney, Lawrence Livermore National Laboratory
- Guide into OpenMP, 这篇笔记主要参考的是这个

课程信息:

输入图片说明

记笔记需要的 chatgpt prompt
- fortran (png) -> cpp

1. Fortran Code Recognition: Recognize the Fortran code shown in this Vim screenshot and convert it into text form.

2. Modern C++ Equivalent: Rewrite the Fortran code in modern C++ using best practices. Specifically:

- Use appropriate C++ constructs and libraries.

- Ensure the C++ code is idiomatic, clean, and adheres to current standards (e.g., C++17 or C++20).

- Provide comments to explain how the Fortran logic maps to the equivalent C++ constructs.

Please ensure the converted C++ code is practical, efficient, and easily understandable.

并行程序分类¶

共享内存式 (shared memory)
- OpenMP
- 多线程

输入图片说明

分配内存式 (distributed memory)
- 理论上没有上限
- cpu与自己节点的内存快速交互
- 通过网络沟通各个节点，交互数据
- 并行进程之间的信息交互需要人工编写

输入图片说明

openmp 重要函数¶

fortran 版本

use omp_lib ! provides OpenMP functions in Fortran

call omp_set_num_threads() ! sets the number of OpenMP threads that will be used in the subsequent parallel regions
omp_get_thread_num() ! retrieves the unique ID of the calling thread (ranging from 0 to num_threads-1)

! `!$omp` 表示该行是 OpenMP 的并行编程指令，而不是普通注释 

! define a parallel region in Fortran. Code inside this region is executed by multiple threads
!$omp parallel  
!$omp end parallel

! sets a parallel region that specifically uses 3 threads
!$omp parallel num_threads(3)

! creates a parallel region where `a`, `b`, and `c` are private variables, meaning each thread gets its own copy
!$omp parallel private(a, b, c)

! `default(none)` 强制程序员显式地声明每个变量的共享属性, `p`, `q`, and `r` are shared by all threads
!$omp parallel default(none) shared(p, q, r)
!$omp parallel default(private) shared(a, b, c)
! `i` is private to each thread but initialized with the value of `i` from outside the parallel region.
!$omp parallel firstprivate(i)

对应c++
- use omp_lib -> #include <omp.h> in C++
- call omp_set_num_threads(X) -> omp_set_num_threads(X);
- omp_get_thread_num() -> omp_get_thread_num();
- !$omp parallel ... !$omp end parallel -> #pragma omp parallel ... (block of code) ... end block
- Clauses on !$omp parallel translate similarly to clauses on #pragma omp parallel in C++.
example

#include <iostream>
#include <omp.h>

int main() {
    // Setting the number of OpenMP threads: omp_set_num_threads(<number>);     
    omp_set_num_threads(4); 

    // Getting the thread number:
    int thread_num = omp_get_thread_num(); // Typically called inside a parallel region
    // Note: Outside a parallel region, usually omp_get_thread_num() returns 0.

    // define a parallel region 
    #pragma omp parallel
    {
        // Code inside this block runs in parallel
        int my_id = omp_get_thread_num();
        // ... do parallel work ...
    }

    // Equivalent to !$omp parallel num_threads(3):
    #pragma omp parallel num_threads(3)
    {
        int my_id = omp_get_thread_num();
        // This block runs in parallel with exactly 3 threads
    }

    // Equivalent to !$omp parallel private(a, b, c):
    int a = 0, b = 1, c = 2; 
    #pragma omp parallel private(a, b, c)
    {
        // Each thread has its own a, b, c
        // The values of a, b, c are uninitialized or unspecified in this new scope,
        // For clarity, re-declare inside the block if needed:
        a = 10; b = 20; c = 30; 
    }

    // Equivalent to !$omp parallel default(none) shared(p, q, r):
    int p = 100, q = 200, r = 300;
    #pragma omp parallel default(none) shared(p, q, r)
    {
        // p, q, r are shared among threads.
        // Any other variable used in this scope must be explicitly declared or specified.
        p++;
    }

    // Equivalent to !$omp parallel default(private) shared(a, b, c):
    // default(private) means that any variable not explicitly listed as shared is private.
    // We must ensure a, b, c are known and shared:
    a = 1; b = 2; c = 3;
    #pragma omp parallel default(private) shared(a, b, c)
    {
        // Here a, b, c are shared. Any other variable is private by default.
        a++; // affects the shared variable
    }

    // Equivalent to !$omp parallel firstprivate(i):
    // This means i is private, but initialized from the value outside the parallel region.
    int i = 42;
    #pragma omp parallel firstprivate(i)
    {
        // Each thread gets its own copy of i, initialized to 42
        i += omp_get_thread_num();
    }

    return 0;
}

OpenMP cpp 语法¶

编译: g++ -fopenmp

其它编译器:

Compiler / Platform	Compiler	Flag
Intel Linux Opteron/Xeon	icc, icpc, ifort	-qopenmp
PGI Linux Opteron/Xeon	pgcc, pgCC, pgf77, pgf90	-mp
GNU Linux Opteron/Xeon	gcc, g++, g77, gfortran	-fopenmp

在 C 和 C++ 中，所有 OpenMP 指令都通过 #pragma omp 指定，并带有参数，以换行符结束。
通常，#pragma omp 只适用于其后紧跟的一条语句，但 barrier 和 flush 指令是例外，它们没有关联的语句。

`parallel`¶

parallel 用于启动一个并行块。它会创建一个包含 N 个 thread 的 team（N 的大小在运行时确定，通常由 CPU 核心数量决定，但也可能受其他因素影响）。这些线程会同时执行下一条语句（或者如果下一条语句是一个用 {...} 包裹的代码块，则执行整个代码块）。在语句执行完成后，所有线程会合并为一个线程。

语法: #pragma omp parallel

举例，

  #pragma omp parallel
  {
    // Code inside this region runs in parallel.
    printf("Hello!\n");
  }

Parallelism conditionality clause `if`¶

举例

  extern int parallelism_enabled;
  #pragma omp parallel for if(parallelism_enabled)
  for(int c=0; c<n; ++c)
    handle(c);

如果 parallelism_enabled 的值\为零，则处理 for 循环的线程team的线程数量将始终为 1。

`for` loop¶

语法: #pragma omp for (只在当前的 parallel team 中在不同线程上分配任务，如果想在新的 team 上分配任务，就用 #pragma omp parallel 创建新的 block)

举例:

 #pragma omp for
 for(int n=0; n<10; ++n)
 {
   printf(" %d", n);
 }
 printf(".\n");

创建新的 thread team:

 #pragma omp parallel
 {
  #pragma omp for
  for(int n=0; n<10; ++n) printf(" %d", n);
 }
 printf(".\n");

等价于

 #pragma omp parallel for
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n")

也可以直接指定线程数

 #pragma omp parallel num_threads(3)
 {
   // This code will be executed by three threads.

   // Chunks of this loop will be divided amongst
   // the (three) threads of the current team.
   #pragma omp for
   for(int n=0; n<10; ++n) printf(" %d", n);
 }

或者 #pragma omp parallel for num_threads(3)

`schedule`¶

for 循环的调度分配算法是可以被显式控制的

 #pragma omp for schedule(static)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

OpenMP 提供了五种调度类型 (scheduling types)：static、dynamic、guided、auto 和（从 OpenMP 4.0 开始支持的）runtime。此外，从 OpenMP 4.5 开始，新增了三种调度修饰符 (scheduling modifiers)：monotonic、nonmonotonic 和 simd。

static 是默认的调度类型。在进入循环时，每个线程会独立决定自己将处理的循环块。

此外，还有 dynamic 调度类型。循环开始时并不会提前固定分配，而是线程在运行时向 OpenMP 运行时库动态请求任务。每个线程会向 OpenMP 运行时库请求一个迭代号，处理完后再请求下一个，如此往复。这种方式可以避免线程因分配的工作完成而闲置。例如，对于一个 for 循环有 10 次迭代，假设有 2 个线程：线程 1 请求并处理迭代 0, 线程 2 请求并处理迭代 1, 线程 1 完成后继续请求迭代 2，依此类推。

当与 ordered 子句一起使用，或者循环中的不同迭代可能需要不同的执行时间时，这种调度方式最为有用。

还可以指定块大小（chunk size）以减少对运行时库的调用次数：

 #pragma omp for schedule(dynamic, 3)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");

这个时候每个线程每次动态地索要三个 iteration number。

还有其它不常用的 scheduling type: * guided: #pragma omp parallel for schedule(guided, 2)。动态分配任务块，但任务块的大小会随着循环的进展逐渐减小，起初，任务块的大小较大（通常与循环迭代总数相关），然后逐步减小到指定的最小块大小，这里是 2。适用于: 循环的迭代耗时不均，但倾向于后续迭代任务较少或较轻。 * auto: 由OpenMP自动决定调度策略 #pragma omp parallel for schedule(auto) * runtime: #pragma omp parallel for schedule(runtime)。用户可以通过环境变量或 API 配置。

scheduling modifiers (没什么用?)： *monotonic , #pragma omp for schedule(monotonic:static, 4), monotonic 修饰符确保每个线程内部的迭代顺序是递增的，但不保证不同线程之间的执行顺序 * nonmonotonic, 使用 nonmonotonic 修饰符时，线程以未指定的顺序执行分配给它们的块 * simd，当循环被标记为 SIMD（单指令多数据）循环时，编译器根据硬件特性（如向量寄存器的大小）自动决定最佳的块大小

`ordered`¶

循环中的 ordered 子句保证始终有一个线程在处理编号最低的未处理任务。

在一个 ordered 循环中只能有一个 ordered 块，不能多也不能少。此外，包含该循环的 for 结构必须带有 ordered 子句。

 #pragma omp for ordered schedule(dynamic)
 for(int n=0; n<100; ++n)
 {
   files[n].compress();

   #pragma omp ordered
   send(files[n]);
 }

这个循环用于“压缩”100个文件，其中一些文件可以并行压缩，但确保文件的“发送”严格按照顺序进行。
如果负责压缩文件 7 的线程完成了压缩，但文件 6 尚未发送，则该线程会等待，直到文件 6 被发送后才发送文件 7，并在发送前不会开始压缩另一个文件。

`collapse`¶

没有 collapse 时，#pragma omp for 只对最外层循环进行并行化。

 #pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }

`reduction`¶

在并行编程中，多个线程可能需要同时更新同一个共享变量（如求和）。如果不加以控制，可能会导致数据竞争，从而产生不正确的结果。reduction 子句通过为每个线程分配一个私有的临时变量来避免这种问题。每个线程在其私有变量上执行累加操作，最后 OpenMP 负责将所有线程的私有变量合并到共享变量中。

 int sum=0;
 #pragma omp parallel for reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

`section`¶

每个 section 最多只能被分配到 1 个线程:

graph LR
    subgraph OpenMP Parallel Region
        T1[线程 1] --> S1[Section 1]
        T2[线程 2] --> S2[Section 2]
        T3[线程 3] --> S3[Section 3]
    end

#pragma omp sections 在当前的 team 下运行:

 #pragma omp sections
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

使用 #pragma omp parallel sections 创建新的 team:

 #pragma omp parallel sections // starts a new team
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

或者

 #pragma omp parallel // starts a new team
 {
   //Work0(); // this function would be run by all threads.

   #pragma omp sections // divides the team into sections
   { 
     // everything herein is run only once.
     { Work1(); }
     #pragma omp section
     { Work2();
       Work3(); }
     #pragma omp section
     { Work4(); }
   }

   //Work5(); // this function would be run by all threads.
 }

`SIMD`¶

float a[8], b[8];
 ...
 #pragma omp simd
 for(int n=0; n<8; ++n) a[n] += b[n];

`collapse`¶

  #pragma omp simd collapse(2)
  for(int i=0; i<4; ++i)
    for(int j=0; j<4; ++j)
      a[j*4+i] += b[i*4+j];

`reduction`¶

It allows to accumulate a shared variable without the atomic clause, but the type of accumulation must be specified. It will often produce faster executing code than by using the atomic clause.

 int sum=0;
 #pragma omp simd reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

`for simd`¶

for 和 simd 构造可以结合使用，将循环的执行划分为多个线程，同时在每个线程内部利用 SIMD 指令对分配到的循环片段进行向量化处理

  float sum(float* table)
  {
    float result=0;
    #pragma omp parallel for simd reduction(+:result)
    for(int n=0; n<1000; ++n) result += table[n];
    return result;
  }

offloading support¶

Offloading 指的是程序的部分代码可以不仅在计算机自身的 CPU 上执行，还可以在连接到计算机的其他硬件（例如显卡）上执行。

`declare target`, `end declare target`¶

declare target 和 end declare target 指令用于标记代码，在这部分代码中，无论是变量还是函数/子程序的声明，都会被编译为可供设备（如 GPU）使用的版本。

#pragma omp declare target
int x;
void murmur() { x+=5; }
#pragma omp end declare target

这会为 x 和 murmur 创建一个或多个版本。其中一组存在于主机计算机上，另一组则存在于设备上并可以在设备上运行。这两组函数和变量是独立的，可能包含彼此不同的值。

Thread-safety (i.e. mutual exclusion)¶

`atomic`¶

用于保护一条语句免于数据竞争

 #pragma omp atomic
 counter += value;

`critical`¶

critical 限制了关联语句/代码块的执行，使得在任何时间点只有一个线程可以执行该代码块。

critical 构造可以选择性地包含一个全局名称，用于标识该 critical 。同一时间内，两个线程不能同时执行具有相同名称的 critical 。

#pragma omp critical(dataupdate)
{
    datastructure.reorganize();
}
...
#pragma omp critical(dataupdate)
{
    datastructure.reorganize_again();
}

在此例中，任意时刻只有一个名为 dataupdate 的 critical 区域可以被执行，并且只有一个线程可以执行该区域。也就是说，reorganize 和 reorganize_again 函数不能同时被调用。

critical 区域的名称在整个程序中都是全局的，因此，如果在多个模块中定义了相同名称的 critical 区域，这些区域也不能同时被执行。

By default, all variables are shared except those declared within the parallel block.
Variables declared within the parallel block are naturally private to each thread
loop iteration variable in a #pragma omp for loop is automatically private to each thread. This is true regardless of where you defined that iteration variable.

private: each thread has their own copy of it shared: each thread accesses the same variable firstprivate: 初始化每个线程的私有变量为主线程变量的值。 lastprivate: 将并行区域中最后一次迭代中线程的私有变量值复制回主线程的共享变量，当并行区域结束后需要保留最后一次迭代的结果时，lastprivate 是合适的选择。 default: default(none)，default(shared)

 int a, b=0;
 #pragma omp parallel for private(a) shared(b)
 for(a=0; a<50; ++a)
 {
   #pragma omp atomic
   b += a;
 }

例: 求 $\pi$¶

输入图片说明

$\pi$ 可以用 $\frac{4}{1+x^2}$ 函数在 0 到 1 范围内的积分求解实现。推导: $x=\tan\theta$。

串行版本¶

fortran 程序

pi.f90

program main
  implicit none
  integer(8) :: n, i
  real :: pi, dx, x
  real, parameter :: pi0 = &
  3.14159265358979323846264338327950288419716939937510582097

  n = 10000000000_8
  dx = 1.0 / real(n)

  pi = 0.0
  do i = 0, n
     x = dx * i
     pi = pi + 4.0 * dx / (1.0 + x * x)
  end do

  write(*,*) pi, abs(pi - pi0) / pi0
end program main

# 在 Fortran 中，`REAL` 类型的默认大小是 4 字节（32 位），使用 `-fdefault-real-8` 将所有 `REAL` 和 `COMPLEX` 类型自动提升为双精度（64 位）。
gfortran -fdefault-real-8 pi.f90

cpp 程序

#include <iostream>
#include <cmath>
#include <cstdint>

// We choose a constexpr double for pi0 for clarity and precision.
// This is a known approximation of π with many digits.
constexpr double pi0 = 3.14159265358979323846264338327950288419716939937510582097;

// This program approximates π by integrating the function f(x) = 4/(1+x²) from x = 0 to x = 1.
// The integral of f(x) from 0 to 1 is π.
int main() {
    // Use a 64-bit integer type for n, given the large range.
    // In Fortran, `integer(8)` typically corresponds to a 64-bit integer.
    // The original Fortran code used n = 10,000,000,000 (ten billion) as a demonstration.
    std::int64_t n = 10000000000LL; 

    // dx = 1.0 / n
    double dx = 1.0 / static_cast<double>(n);

    // Initialize pi accumulator
    double pi = 0.0;

    // Perform the summation (Riemann sum)
    // Note: This is a very large loop and would be time-consuming in practice.
    for (std::int64_t i = 0; i <= n; ++i) {
        double x = dx * static_cast<double>(i);
        // Increment pi by the area of the small rectangle: f(x)*dx
        pi += (4.0 * dx) / (1.0 + x * x);
    }

    // Compute the relative error
    double relative_error = std::fabs(pi - pi0) / pi0;

    // Print results
    // The first output is the computed approximation of pi.
    // The second output is the relative error compared to pi0.
    std::cout << pi << " " << relative_error << "\n";

    return 0;
}

并行版本¶

fortran code

program main
  use omp_lib
  implicit none

  integer(8) :: n, i, argn
  real :: pi, dx, x, timer
  character(len=2) :: arg
  ! A high-precision approximation of π for error comparison
  real, parameter :: pi0 = 3.14159265358979323846264338327950288419716939937510582097

  ! Print a separator line for clarity
  write(*,*) "--------"

  ! Get command-line argument and interpret it as an integer (number of threads)
  call get_command_argument(1, arg)
  read(arg,'(I2)') argn
  write(*,*) "argn:", argn

  ! Set the number of OpenMP threads based on the command-line argument
  call omp_set_num_threads(argn)

  ! Set the number of intervals for the integral approximation
  n = 5000000000_8
  write(*,*) "n:", n

  ! Compute dx = 1/n
  dx = 1.0 / real(n)
  ! Start the timer using OpenMP wall time function
  timer = omp_get_wtime()

  ! Initialize pi accumulator
  pi = 0.0

  ! Begin the parallel region and distribute the loop among threads
  ! - default(firstprivate): variables inside are private by default
  ! - reduction(+:pi): pi is summed across threads at the end
  !$omp parallel default(firstprivate) reduction(+:pi)
  !$omp do
  do i = 0, n
     x = dx * i
     ! Add the integrand value (1/(1+x²)) to pi
     pi = pi + 1.0 / (1.0 + x*x)
  end do
  !$omp end do
  !$omp end parallel

  ! After the parallel region, scale pi by 4.0*dx to get the approximation of π
  pi = pi * 4.0 * dx

  ! Compute elapsed time
  timer = omp_get_wtime() - timer

  ! Print out results
  ! omp_get_max_threads() returns the maximum number of threads available
  write(*,*) "omp_get_max_threads:", omp_get_max_threads()
  ! Print the computed pi, relative error, and the reference pi0
  write(*,*) pi, abs(pi - pi0) / pi0
  write(*,*) pi0
  write(*,*) "timer:", timer

end program main

gfortran -O3 -fdefault-real-8 -fopenmp pi.f90

cpp 程序

#include <iostream>
#include <string>
#include <cmath>
#include <omp.h>

int main(int argc, char* argv[]) {
    // A high-precision approximation of π for error comparison
    constexpr double pi0 = 3.14159265358979323846264338327950288419716939937510582097;

    // Print a separator line for clarity
    std::cout << "--------\n";

    // Default number of threads if not provided as a command-line argument
    int argn = 1;
    if (argc > 1) {
        // Convert the first argument to an integer (number of threads)
        argn = static_cast<int>(std::stoll(argv[1]));
    }
    std::cout << "argn: " << argn << "\n";

    // Set the number of OpenMP threads
    omp_set_num_threads(argn);

    // Set the number of intervals for the integral approximation
    // The '8' suffix in Fortran code indicates octal, but here we assume it's just a literal.
    // For extremely large n, ensure your system can handle it.
    long long n = 5000000000LL; 
    std::cout << "n: " << n << "\n";

    // Compute dx = 1/n
    double dx = 1.0 / static_cast<double>(n);

    // Start the timer using OpenMP wall time function
    double start_time = omp_get_wtime();

    // Initialize pi accumulator
    double pi = 0.0;

    // Parallelize the loop using OpenMP
    // Using a reduction clause on pi ensures a thread-safe accumulation.
    // Variables that need to be private are typically loop indices or computed within the loop.
    #pragma omp parallel for reduction(+:pi)
    for (long long i = 0; i <= n; ++i) {
        double x = dx * static_cast<double>(i);
        // Add the integrand value (1/(1+x²)) to pi
        pi += 1.0 / (1.0 + x * x);
    }

    // After the parallel region, scale pi by 4.0*dx to get the approximation of π
    pi = pi * 4.0 * dx;

    // Compute elapsed time
    double timer = omp_get_wtime() - start_time;

    // Print out results
    // omp_get_max_threads() returns the maximum number of threads available
    std::cout << "omp_get_max_threads: " << omp_get_max_threads() << "\n";

    // Print the computed pi, relative error, and the reference pi0
    double relative_error = std::fabs(pi - pi0) / pi0;
    std::cout << pi << " " << relative_error << "\n";
    std::cout << pi0 << "\n";
    std::cout << "timer: " << timer << "\n";

    return 0;
}

g++ -O3 -fopenmp -std=c++17 pi.cpp  && ./a.out 8

赵隐剑老师_并行编程入门实践课程 01¶

并行程序分类¶

openmp 重要函数¶

OpenMP cpp 语法¶

parallel¶

Parallelism conditionality clause if¶

for loop¶

schedule¶

ordered¶

collapse¶

reduction¶

section¶

SIMD¶

collapse¶

reduction¶

for simd¶

offloading support¶

declare target, end declare target¶

Thread-safety (i.e. mutual exclusion)¶

atomic¶

critical¶

Controlling which data to share between threads¶

例: 求 \(\pi\)¶

串行版本¶

并行版本¶

`parallel`¶

Parallelism conditionality clause `if`¶

`for` loop¶

`schedule`¶

`ordered`¶

`collapse`¶

`reduction`¶

`section`¶

`SIMD`¶

`collapse`¶

`reduction`¶

`for simd`¶

`declare target`, `end declare target`¶

`atomic`¶

`critical`¶