Synchronized Code — Unlocking the Power of Accelerators and Parallel Programs. by TonyM | October, 2022

Locks, semaphores, critical sections, and other synchronization constructs used to prevent crashes and bugs

FLY:D by Unsplash . photo on

In our previous article, An Introduction to Accelerators and Parallel Programming, we talked about the basics of parallel programming, including some simple ways to take a program and run it on multiple CPUs or accelerators. Of course, that post only scratched the surface of the parallel and accelerated programming challenges.

In general, this type of programming uses an asynchronous model of execution, which means that multiple computations will be happening in a system at the same time. The challenge is making sure this execution is fast and accurate.

In this post, I will talk about making sure that programs are running correctly using synchronization.

First, put a little detour into the word “thread”, because mostly it will be easier to type a thread than a part of your program that is executing concurrently with another part of your program and doing so on one piece of hardware. .

In the context of your parallel program, you write your code, which can run on multiple processors simultaneously. Each piece of code that runs simultaneously is a thread. This may sound complicated, but luckily, most programming paradigms allow us to focus on what we want to parallelize without focusing on the fine-grained details of using threads and how we parallelize it. want to make.

The synchronization construct enforces the order of execution for our threads. When we allow threads to run, we need to consider a number of things to ensure that our program behaves as we expect, including:

  1. Are multiple threads trying to access a shared resource at the same time?
  2. Do multiple threads need to complete their work before moving on to the next part of the program?
  3. Are there issues that will cause our program to hang such that all the threads are waiting for each other for some reason?

Why does synchronization matter?

Imagine we want to do something simple like count the number of people in a stadium, and we have three people counting up. A simple algorithm would be:

1. Divide the Stadium into Squares

2. Assign one of three people counting each section

3. Does everyone count people in their assigned classes?

4. Add up the value of each of the three people into a total for the entire stadium.

It sounds straightforward, but there’s a catch: To sum up the values ​​in step 4, each person needs to add to that value what the current total is, and then write the updated total down on paper for everyone to see. So, really, there should really be three actions (read, add, write) to update the total.

figure 1

Per Figure 1, the correct answer to this problem is 18. Each thread has a variable, local_count which is created and used only in that thread, which means that the count of classes is independent of other threads. However, when threads go to update the shared aggregate value, a potential problem occurs.

Let’s say thread 1 and thread 2 try to update simultaneously Total, It is possible that they both read the value of 0 for the total and then add their local_count for, which means they think the updated value for Total Should be 0+4=4 or 0+6=6 respectively. when they go to update TotalThe output can be 4 or 6.

In any case, the result total after both threads update should actually be 10, This is a data race and why synchronization is important for correctness. We need to ensure that the shared variables are updated one at a time. Here’s a simple example of what it might look like using OpenMP:

There are three different OpenMP compiler instructions at work in this example.

  • omp parallel sections – tells the compiler to run each section in parallel (aka each section on its thread) within the following block of code. The pragma block waits until each individual segment in the block is completed.
  • omp section – Tells the compiler that the following block of code is a complete section that can be run on a thread.
  • omp critical – Tells the compiler to allow only one thread to execute the following block of code, which is just a one-line addition to the total.

To compile the code, I used the Intel® oneAPI DPC++/C++ compiler, which supports OpenMP, using the following command:

> icx count_sections_omp.cpp -o serial.exe
> icx -fopenmp count_sections_omp.cpp -o parallel.exe

The first command compiles without OpenMP enabled, while the second command tells the compiler to use OpenMP pragmas. My test system, in this case, is my HP Envy 16″ laptop powered by an Intel® Core™ i7–12700H processor with 32GB of RAM. Running the two executables looks like this:

Elapsed time in milliseconds: 1570 ms
Elapsed time in milliseconds: 520 ms

You can see the difference in runtime between parallel code and code running on only one CPU.

just for fun, i removed it #pragma Tried important instructions from my code and ran it several times. After a few tries, I got this output:

Elapsed time in milliseconds: 520 ms

In this case, you can see that the value (with a value of 4) from the first section was somehow missed because each thread was not added to the total in an orderly manner.

resource sharing synchronization

There are several synchronization structures that protect access to resources, and I won’t be able to give code examples for all of them. So to allow you to know a little more, I’ll list some and what they’re commonly used for:

  • Critical Section — Allows only one thread to run the code protected by the critical section simultaneously.
  • Lock/Mutex – Protects a code section by asking each thread to explicitly request access via the lock before running some code. It differs from a critical section because there are cases where multiple threads may simultaneously access a shared resource (for example, a reader/writer paradigm will be discussed below).
  • Semaphore – given a predefined number N, it only allows N users to run your code simultaneously.

Note that synchronization, if used inappropriately, can lead to incorrect results and problems where threads fail to progress or hang.

One of the frequently used parallel programming paradigms is that of reader/writer. This synchronization is used when two types of users access some shared value in memory:

  • Readers – need to know shared values
  • author – a shared value needs to be updated

If you think about it, multiple readers may look at one value at once, because they will always see the same value as the rest. However, when a writer needs to update a value, it must prevent others from reading or writing it. This ensures that the program has consistent value for all readers and writers.

To make it more concrete, let’s look at an example of what it looks like in SYCL and how reader/writer usage affects the behavior and performance of your program.

Reader/Writer SYCL Example

Let’s see how SYCL uses reader/writer synchronization to control access to an array.

You can see that the functions of reading and writing are basically the same. Some key things to understand:

  • line 18 – Read() makes a sycl::access::mode::read Accessor
  • line 34 – Write() makes a sycl::access::mode::write Accessor
  • Lines 16 and 32 — Submit to the queue of the compute device of choice. It is asynchronous, so the code returns before the task is completed.
  • Line 25 and 41 – Queue the same task to be executed

Also, to make the test more understandable, I set it doWork() task to run for about a second on my particular machine WORK_ITERS variable. If you test it yourself, you can try adjusting it to make the test faster or slower.

Now that we have our main read and write operations, let’s look at how read vs write access mode affects a program:

For this program, the lines of interest are 25-41, and each loop iterates NUM_ACCOUNTS (8) Times. Considering how reader/writer synchronization works, it should be time to run:

  • Lines 25-28: Do eight lessons of our buffer (one second, happens in parallel)
  • Lines 31-34: Do eight writes to our buffer (eight seconds, runs in sequence)
  • Lines 37-40: Make eight writes to our buffer (one second, happens in parallel)

Note that line 41 causes the program to wait for all asynchronous operations in the queue to complete before proceeding.

Once again, I used the Intel DPC++ Compiler to compile my code and run this:

>icx -fsycl read_write_sycl.cpp
Running on device: 12th Gen Intel(R) Core(TM) i7-12700H
Elapsed time in milliseconds: 11105 ms

Our elapsed runtime is close to our expected value of 10 seconds.

Reader/Writer vs Critical Section

To understand the value of reader/writer, imagine if, instead of constructing reader/writer, we used a locking mechanism, where both read and write are treated as requiring exclusive access.

To simulate this, I used my . updated line 18 in Read() To use write lock (single access at a time) as follows:

auto acc = buf.get_access<sycl::access::mode::read>(h);
auto acc = buf.get_access<sycl::access::mode::write>(h);

Recompiling my code and running it again, I get the following output:

Running on device: 12th Gen Intel(R) Core(TM) i7–12700H
Elapsed time in milliseconds: 24035 ms

Here’s a comparison of how the code should run compared to the reader/writer code:

  • Lines 25-28: Do eight lessons of our buffer (eight seconds, runs in sequence)
  • Lines 31-34: Do eight writes to our buffer (eight seconds, runs in sequence)
  • Lines 37-40: Write eight of our buffers (eight seconds, runs in sequence)

This would suggest that a reader/writer is always better, but this is not true. A reader/writer lock is a more complex synchronization construct with more runtime overhead, so keep this in mind when you choose which synchronization to use.

If you are interested in SYCL and how it helps you to secure data with synchronization structures, you can watch this Introduction to SYCL Basics video.

Accelerators and parallel programming can give us faster applications and programs when used properly. But as with most things in life, benefits don’t always come for free. In this case, as we seek to make our programs run faster and on more diverse compute hardware, we must also learn and understand the APIs that help us meet the current challenges of accelerator and parallel computing.

This post just scratches the surface of synchronization and the pitfalls of accelerator programming. Next time I’ll talk about why a basic understanding of your target accelerator architecture may be important to you as you program for performance.

Want to Connect?If you want to see what random tech news I’m reading, you can follow me on Twitter. Also, check out Code Together, an Intel podcast for developers that I host where we talk tech.

Leave a Comment