Part 9 - Parallelizing computations

Posted on Feb 13, 2023

(Last updated: May 26, 2024)

Concurrent programming introduces:

The potential for parallel execution (faster, better resource usage)
The risk of race conditions (incorrect, unpredictable computations)

The challenge of concurrent programming is introducing parallelism without affecting correctness.

Let’s define how to parallelize:

A task $(F, D)$ consists in computing the result F $(D)$ of applying function $F$ to input data $D$

A parallelization of $(F, D)$ is a collection $(F_1 , D_1 ),\ (F_2 , D_2),\ \dots$ of tasks such that $F (D)$ equals the composition of $F_1 (D_1),\ F_2 (D_2),\ \dots $

Synchronization

Synchronization is required to ensure correctness, but it also introduces mental overhead.

In shared-memory concurrency:

Synchronization is based on locking.
Locking synchronizes data from cache to main memory, which may involve a 100x overhead.
Other costs with locking may include context switching (wait/signal) and system calls (mutual exclusion primitives).

In message-passing concurrency:

Synchronization is based on messages
Exchanging small messages is efficient, but sending around large data is quite expensive (still goes through main memory).
Other costs associated with message passing may include extra acknowledgment messages and mailbox management (removing unprocessed messages).

Also, an important note is about processes, creating a new process is generally expensive compared to sequential function calls within the same process, since it involves:

Reserving memory
Registering the new process with runtime system
Setting up the process’s local memory (stack and mailbox)

Even if process creation is optimized, the cost of spawning should be weighted against the speedup that can be obtained by additional parallelism.

In particular, when the processes become way more than the available processors, there will be diminishing returns with more spawning

Now let’s cover the solutions

Fork/join parallelism

Forking: spawning child processes and assigning them smaller tasks
Joining: waiting for the child processes to complete and combining their results

Note, the order in which we wait at a join node for forked children does not affect the total waiting time.

In order to obtain good performance using fork/join parallelism:

After forking children tasks, keep some work for the parent task before it joins the children.
For the same reason, use invoke and invokeAll only at the top level as a norm.
Perform small enough tasks sequentially in the parent task, and fork children tasks only when there is a substantial chunk of work left
Make sure different tasks can proceed independently – minimize data dependencies

The advantages of parallelism may only be visible with several physical processors, and on very large inputs.

Pools and work stealing

Process pools are a technique to address the problem of using an appropriate number of processes.

A pool creates a number of worker processes upon initialization. The number of workers is chosen according to the actual available resources to run them in parallel – a detail which pool users need not know about:

As long as more work is available, the pool deals a work assignment to a worker that is available.
The pool collects the results of the workers’ computations.
When all work is completed, the pool terminates and returns the overall result This kind of pool is called a dealing pool: it actively deals work to workers.

Workers are servers that run as long as the pool that created them does A worker can be in one of two states:

Idle: waiting for work assignments from the pool.
Busy: computing a work assignment.

As soon as a worker completes its work assignments, it sends the result to the pool and goes back to being idle.

A pool keeps track of:

The remaining work – not assigned yet
The busy workers
The idle workers

The pool also stores:

A split function, used to extract a single work item
A join function, used to combine partial results
The overall result of the computation that is underway

The pool terminates and returns the result of the computation when there are no pending work items, and all workers are idle (thus all work has been done). As long as there is some pending work and some idle workers, the pool deals work to some of those idle workers.

When there are no pending work items or all workers are busy, the pool can only wait for workers to send back results

Dealing pools work well if:

The workload can be split in even chunks.
The workload does not change over time (for example if users send new tasks or cancel tasks dynamically).

Under these conditions, the workload is balanced evenly between workers, so as to maximize the amount of parallel computation In realistic applications, however, these conditions are not met:

It may be hard to predict reliably which tasks take more time to compute the workload is highly dynamic

Stealing pools use a different approach to allocating tasks to workers that better addresses these challenging conditions

A stealing pool associates a queue to every worker process The pool distributes new tasks by adding them to the workers’ queues When a worker becomes idle:

First, it gets the next task from its own queue
If its queue is empty, it can directly steal tasks from the queue of another worker that is currently busy.

With this approach, workers adjust dynamically to the current working conditions without requiring a supervisor that can reliably predict the workload required by each task

With stealing, the pool may even send all tasks to one default thread, letting other idle threads steal directly from it, simplifying the pool and reducing the synchronization costs it incurs