__syncthreads()是cuda的内建函數,用于塊内線程通信.
__syncthreads() is you garden variety thread barrier. Any thread reaching the barrier waits until all of the other threads in that block also reach it. It is
designed for avoiding race conditions when loading shared memory, and the compiler will not move memory reads/writes around a __syncthreads().
其中,最重要的了解是那些可以到達__syncthreads()的線程需要其他可以到達該點的線程,而不是等待塊内所有其他線程。
一般使用__syncthreads()程式結構如下:
1 __share__ val[];
2 ...
3 if(index < n)
4 {
5 if(tid condition)
6 {
7 do something with val;
8 }
9 __syncthreads();
10 do something with val;
11 __syncthreads();
12 }
這種結構塊内所有線程都會到達__syncthreads(),塊内線程同步.
1 __share__ val[];
2 ...
3 if(index < n)
4 {
5 if(tid condition)
6 {
7 do something with val;
8 __syncthreads();
9 }
10 else
11 {
12 do something with val;
13 __syncthreads();
14 }
15 }
這種結構将塊内線程分成兩部分,每一部分對共享存儲器進行些操作,并在各自部分裡同步.這種結構空易出現的問題是若兩部分都要對某一位址的共享存儲器進行寫操作,将可能出
現最後寫的結果不一緻錯誤.要讓錯誤不發生需要使用原子操作.
1 __share__ val[];
2 ....
3 if(index < n)
4 {
5 if(tid condition)
6 {
7 do something with val;
8 __syncthreads();
9 }
10 do something with val;
11 }