OpenMP 嵌套并行操作的特性_综合

1 执行模型

OpenMP 采用 fork-join（分叉-合并）并行执行模式。线程遇到并行构造时，就会创建由其自身及其他一些额外（可能为零个）线程组成的线程组。遇到并行构造的线程成为新组中的主线程。组中的其他线程称为组的从属线程。所有组成员都执行并行构造内的代码。如果某个线程完成了其在并行构造内的工作，它就会在并行构造末尾的隐式屏障处等待。当所有组成员都到达该屏障时，这些线程就可以离开该屏障了。主线程继续执行并行构造之后的用户代码，而从属线程则等待被召集加入到其他组。

OpenMP 并行区域之间可以互相嵌套。如果禁用嵌套并行操作，则由遇到并行区域内并行构造的线程所创建的新组仅包含遇到并行构造的线程。如果启用嵌套并行操作，则新组可以包含多个线程。

OpenMP 运行时库维护一个线程池，该线程池可用作并行区域中的从属线程。当线程遇到并行构造并需要创建包含多个线程的线程组时，该线程将检查该池，从池中获取空闲线程，将其作为组的从属线程。如果池中没有足够的空闲线程，则主线程获取的从属线程可能会比所需的要少。组完成执行并行区域时，从属线程就会返回到池中。

2 控制嵌套并行操作

通过在执行程序前设置各种环境变量，可以在运行时控制嵌套并行操作。

2.1 `OMP_NESTED`

可通过设置 OMP_NESTED 环境变量或调用 omp_set_nested() 来启用或禁用嵌套并行操作。

以下示例中的嵌套并行构造具有三个级别。

示例 1 嵌套并行操作示例

[cpp]  view plain copy 
     
    
 #include <omp.h>  
 #include <stdio.h>  
 void report_num_threads(int level)  
 {  
     #pragma omp single  
     {  
         printf("Level %d: number of threads in the team - %d\n",  
                   level, omp_get_num_threads());  
     }  
  }  
 int main()  
 {  
     omp_set_dynamic(0);  
     #pragma omp parallel num_threads(2)  
     {  
         report_num_threads(1);  
         #pragma omp parallel num_threads(2)  
         {  
             report_num_threads(2);  
             #pragma omp parallel num_threads(2)  
             {  
                 report_num_threads(3);  
             }  
         }  
     }  
     return(0);  
 }  

启用嵌套并行操作时，编译和运行此程序会产生以下（经过排序的）输出：

% setenv OMP_NESTED TRUE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2

比较禁用嵌套并行操作时运行相同程序的输出结果：

% setenv OMP_NESTED FALSE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1

2.2 `SUNW_MP_MAX_POOL_THREADS`

OpenMP 运行时库维护一个线程池，该线程池可用作并行区域中的从属线程。设置 SUNW_MP_MAX_POOL_THREADS 环境变量可控制池中线程的数量。缺省值为 1023。

线程池只包含运行时库创建的非用户线程。它不包含初始线程或由用户程序显式创建的任何线程。如果将此环境变量设置为零，则线程池为空，并且将由一个线程执行所有并行区域。

以下示例说明，如果池中没有足够的线程，并行区域可能获取较少的线程。代码与上面的代码相同。使所有并行区域同时处于活动状态所需的线程数为 8 个。池需要至少包含 7 个线程。如果将 SUNW_MP_MAX_POOL_THREADS 设置为 5，则四个最里面的并行区域中的两个区域可能无法获取所请求的所有从属线程。一种可能的结果如下所示。

% setenv OMP_NESTED TRUE
% setenv SUNW_MP_MAX_POOL_THREADS 5
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1

2.3 `SUNW_MP_MAX_NESTED_LEVELS`

环境变量 SUNW_MP_MAX_NESTED_LEVELS 可控制需要多个线程的嵌套活动并行区域的最大深度。

活动嵌套深度大于此环境变量值的任何活动并行区域将仅由一个线程来执行。如果并行区域没有 IF 子句，或者其 IF 子句计算为 true，则将此并行区域视为活动区域。活动嵌套级别的缺省最大数量是 4。

以下代码将创建 4 级嵌套并行区域。如果将 SUNW_MP_MAX_NESTED_LEVELS 设置为 2，则嵌套深度为 3 和 4 的嵌套并行区域将由单个线程来执行。

[cpp]  view plain copy 
     
    
 #include <omp.h>  
 #include <stdio.h>  
 #define DEPTH 5  
 void report_num_threads(int level)  
 {  
     #pragma omp single  
     {  
         printf("Level %d: number of threads in the team - %d\n",  
                level, omp_get_num_threads());  
     }  
 }  
 void nested(int depth)  
 {  
     if (depth == DEPTH)  
         return;  
   
     #pragma omp parallel num_threads(2)  
     {  
         report_num_threads(depth);  
         nested(depth+1);  
     }  
 }  
 int main()  
 {  
     omp_set_dynamic(0);  
     omp_set_nested(1);  
     nested(1);  
     return(0);  
 }  

使用最大嵌套级别 4 来编译和运行此程序会产生以下可能的输出。（实际结果取决于操作系统调度线程的方式。）

% setenv SUNW_MP_MAX_NESTED_LEVELS 4
% a.out |sort +2n
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2
Level 4: number of threads in the team - 2

使用设置为 2 的嵌套级别来运行会产生以下可能的结果：

% setenv SUNW_MP_MAX_NESTED_LEVELS 2
% a.out |sort 
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1
Level 4: number of threads in the team - 1

此外，这些示例只显示了一些可能的结果。实际结果取决于操作系统调度线程的方式。

3 在嵌套并行区域中使用 OpenMP 库例程

在嵌套并行区域中调用以下 OpenMP 例程需要仔细斟酌。

- omp_set_num_threads()
- omp_get_max_threads()
- omp_set_dynamic()
- omp_get_dynamic()
- omp_set_nested()
- omp_get_nested()

"set" 调用只影响调用线程所遇到的处于同一嵌套级别或内部嵌套级别的后续并行区域。它们不影响其他线程遇到的并行区域。

"get" 调用将返回由调用线程设置的值。当某个线程成为执行并行区域的组的主线程后，所有其他的组成员会继承该主线程的值。当主线程退出嵌套并行区域，并继续执行封闭并行区域时，该线程的值会恢复为刚执行嵌套并行区域之前封闭并行区域中的值。

示例 2 在并行区域中调用 OpenMP 例程

[cpp]  view plain copy 
     
    
 #include <stdio.h>  
 #include <omp.h>  
 int main()  
 {  
     omp_set_nested(1);  
     omp_set_dynamic(0);  
     #pragma omp parallel num_threads(2)  
     {  
         if (omp_get_thread_num() == 0)  
             omp_set_num_threads(4);       /* line A */  
         else  
             omp_set_num_threads(6);       /* line B */  
   
         /* The following statement will print out 
          * 
          * 0: 2 4 
          * 1: 2 6 
          * 
          * omp_get_num_threads() returns the number 
          * of the threads in the team, so it is 
          * the same for the two threads in the team. 
          */  
         printf("%d: %d %d\n", omp_get_thread_num(),  
                omp_get_num_threads(),  
                omp_get_max_threads());  
   
         /* Two inner parallel regions will be created 
          * one with a team of 4 threads, and the other 
          * with a team of 6 threads. 
          */  
         #pragma omp parallel  
         {  
             #pragma omp master  
             {  
                 /* The following statement will print out 
                  * 
                  * Inner: 4 
                  * Inner: 6 
                  */  
                 printf("Inner: %d\n", omp_get_num_threads());  
             }  
             omp_set_num_threads(7);      /* line C */  
         }  
         /* Again two inner parallel regions will be created, 
          * one with a team of 4 threads, and the other 
          * with a team of 6 threads. 
          * 
          * The omp_set_num_threads(7) call at line C 
          * has no effect here, since it affects only 
          * parallel regions at the same or inner nesting 
          * level as line C. 
          */  
   
         #pragma omp parallel  
         {  
             printf("count me.\n");  
         }  
     }  
     return(0);  
 }  

编译和运行此程序会产生一种以下可能的结果：

% a.out
0: 2 4
Inner: 4
1: 2 6
Inner: 6
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.
count me.

4 有关使用嵌套并行操作的一些提示

嵌套并行区域提供一种直接的方法来允许多个线程参与到计算中。

例如，假定您的程序包含两级并行操作，并且每个级别的并行操作等级为 2。此外，还假定您的系统有四个 CPU，您要使用全部四个 CPU 来加快此程序的执行速度。如果只并行化其中任意一个级别，则只需使用两个 CPU。您想要并行化两个级别。
嵌套并行区域容易创建过多的线程，从而占用过多的系统资源。请适当地设置 SUNW_MP_MAX_POOL_THREADS 和SUNW_MP_MAX_NESTED_LEVELS 以限制使用的线程数，防止系统资源枯竭。
创建嵌套并行区域会增加开销。如果在外部级别有足够的并行操作并且负载平衡，通常在计算的外部级别使用所有线程要比在内部级别创建嵌套并行区域更有效。

例如，假定您的程序包含两级并行操作。外部级别的并行操作等级为 4，并且负载平衡。您的系统具有四个 CPU，您要使用所有四个 CPU 来加快此程序的执行速度。那么，通常将所有 4 个线程用于外部级别比将 2 个线程用于外部并行区域而将其他 2 个线程用作内部并行区域的从属线程的性能要好。
实例：
[cpp] view plain copy
1. //#include <stdio>
2. void report_num_threads(int level)
3. {
4. //#pragma omp parallel
5. {
6. printf("level %d:number of threads in the team - %d\n",level,omp_get_num_threads());
7. }
8. }
10. int _tmain(int argc, _TCHAR* argv[])
11. {
12. omp_set_nested(1);
13. //omp_set_dynamic(0);
14. #pragma omp parallel num_threads(2)
15. {
16. //printf("here1\n");
17. report_num_threads(1);
18. #pragma omp parallel num_threads(2)
19. {
20. //printf("here1\n");
21. report_num_threads(2);
22. #pragma omp parallel num_threads(2)
23. {
24. //printf("here1\n");
25. report_num_threads(3);
26. }
27. }
28. }
29. return 0;
30. }

OpenMP 嵌套并行操作的特性