Parallel and Distributed Computing

The Problem. You have 1,000 images to compress. One machine handles one at a time and takes 10 ms each β€” that's 10 seconds total if you're clever, or 10 minutes if you're not. The difference is strategy.

Each section explains one strategy and lets you run real Python to see it in action. Check your understanding at the bottom once you’ve read all three.

🟑 Serial Computing1 processor · tasks run one after another

A single processor handles one task at a time. Task 2 cannot start until Task 1 finishes. Total time = sum of every task. Simple and predictable β€” but it doesn’t scale.

%%{init: {"theme": "base", "themeVariables": {"fontSize": "15px"}, "flowchart": {"nodeSpacing": 50, "rankSpacing": 60, "htmlLabels": true}}}%% graph LR T1["⏳ Task 1"] -->|done| T2["⏳ Task 2"] -->|done| T3["⏳ Task 3"] -->|done| T4["⏳ Task 4"] -->|done| R(["βœ… Result"])

Code Runner Challenge

Run serial processing and observe the timing. Try changing the unit values β€” notice total time always equals the sum.

Lines: 1 Characters: 0
Output
Click "Run" in code control panel to see output ...
πŸ”΅ Parallel Computing4+ cores Β· same machine Β· shared memory Β· no network cost

Multiple cores inside the same machine work on different tasks simultaneously. Because cores share memory directly there is zero network overhead. Total time β‰ˆ slowest single task β€” not the sum.

%%{init: {"theme": "base", "themeVariables": {"fontSize": "15px"}, "flowchart": {"nodeSpacing": 50, "rankSpacing": 60, "htmlLabels": true}}}%% graph TD Q(["πŸ“‹ Task Queue"]) --> C1["πŸ–₯️ Core 1
Task 1"] Q --> C2["πŸ–₯️ Core 2
Task 2"] Q --> C3["πŸ–₯️ Core 3
Task 3"] Q --> C4["πŸ–₯️ Core 4
Task 4"] C1 --> R(["βœ… Result"]) C2 --> R C3 --> R C4 --> R

Code Runner Challenge

Run parallel processing with ThreadPoolExecutor. Try making Task B much heavier β€” notice other tasks finish while it runs.

Lines: 1 Characters: 0
Output
Click "Run" in code control panel to see output ...
🟒 Distributed Computingmany machines · network overhead · virtually unlimited scale

Separate machines β€” called nodes β€” collaborate over a network. Each node works independently, then sends its result back. The network handshake adds latency per task, but you gain scale no single machine can match. Distributed is essential when the problem is too large for one machine.

%%{init: {"theme": "base", "themeVariables": {"fontSize": "15px"}, "flowchart": {"nodeSpacing": 50, "rankSpacing": 60, "htmlLabels": true}}}%% graph TD Q(["πŸ“‹ Task Queue"]) --> W1["βš™οΈ Node 1
Task 1"] Q --> W2["βš™οΈ Node 2
Task 2"] Q --> W3["βš™οΈ Node 3
Task 3"] Q --> W4["βš™οΈ Node 4
Task 4"] W1 -->|"πŸ“‘ network"| R(["βœ… Result"]) W2 -->|"πŸ“‘ network"| R W3 -->|"πŸ“‘ network"| R W4 -->|"πŸ“‘ network"| R

Code Runner Challenge

Run the distributed simulation with asyncio. Try changing NETWORK_MS to 500 β€” watch overhead dominate when tasks are light.

Lines: 1 Characters: 0
Output
Click "Run" in code control panel to see output ...

Check Your Understanding

Now that you’ve seen all three paradigms β€” answer without looking back!

❓ A bakery has one oven and bakes one tray of cookies at a time. Tray 2 goes in only after tray 1 comes out. Which computing paradigm does this model?

❓ You are rendering a 3D film. Every frame is completely independent, and your workstation has a 16-core CPU. Which strategy cuts render time most?

❓ Training a state-of-the-art AI model requires 40 TB of data β€” far more than the RAM of any single machine. Which approach is necessary?

Scenario Challenges

❓ Scenario A β€” You have a 200 GB dataset to process on a machine with 12 cores. All tasks are fully independent.

❓ Scenario B β€” A genomics pipeline must search a 500 TB DNA database. The largest single server available has 2 TB of RAM.

❓ Scenario C β€” You need to run a single, indivisible 30-second calculation that cannot be broken into sub-tasks at all.

Scenario D β€” Code Challenge

The script below processes 8 images one at a time β€” it’s too slow. Convert the serial loop to run in parallel using ThreadPoolExecutor. The built-in checker will measure your speedup and tell you if you passed.

πŸ’‘ Hint 1 β€” What kind of problem is this?
The 8 images are completely independent of each other β€” compressing image 3 doesn't need any result from image 2. When tasks are independent, they're perfect candidates for parallel execution. You need a Python tool that can run the same function on multiple inputs at the same time.
πŸ’‘ Hint 2 β€” Which module and class?
Python's concurrent.futures module is already imported at the top. The class you want is ThreadPoolExecutor β€” it manages a pool of worker threads and distributes tasks across them automatically.
from concurrent.futures import ThreadPoolExecutor
πŸ’‘ Hint 3 β€” How do you create a pool?
Use a with block so the pool is cleaned up automatically when done. max_workers controls how many threads run at once β€” set it to the number of tasks or your CPU core count, whichever is smaller.
with ThreadPoolExecutor(max_workers=4) as pool:
    # submit work here
πŸ’‘ Hint 4 β€” How do you map tasks to the pool?
pool.map(fn, iterable) works like the built-in map() β€” it calls fn once for each item in the iterable, but runs them in parallel across the worker threads. Wrap it in list() to collect all results.
results = list(pool.map(compress_image, images))
Notice that compress_image already takes a single argument (image_id), so it plugs straight into pool.map without any changes.
πŸ’‘ Hint 5 β€” What exactly needs to change?
Only the lines between the two timing markers need to change. Replace the entire serial for loop (including the results = [] line) with a single with block. Everything else β€” the function definition, the images list, and the checker β€” stays exactly as-is.
# BEFORE
results = []
for img_id in images:
    results.append(compress_image(img_id))

# AFTER β€” swap these four lines for two
πŸ”‘ Hint 6 β€” Full solution (try on your own first!)
Replace the serial block with:
start = time.perf_counter()

with ThreadPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(compress_image, images))

elapsed = time.perf_counter() - start
That's it β€” two lines of parallel code replace four lines of serial code, and the checker at the bottom does the rest.

Code Runner Challenge

Convert the serial `for` loop to parallel using `ThreadPoolExecutor`. The checker at the bottom will automatically detect whether your code runs faster than the serial baseline and report your speedup.

Lines: 1 Characters: 0
Output
Click "Run" in code control panel to see output ...

Simulating the Difference

Build your own task queue, then watch all three approaches race.

  • Click a task to increase its weight (1β†’2→…→7β†’1 dots)
  • Shift+click or drag to multi-select β€” then click any selected task to cycle all at once
  • Deselect to ungroup
  • Serial β€” 1 worker, sequential Β  Β  Parallel β€” 4 cores, no overhead Β  Β  Distributed β€” 4 nodes, 600ms handshake but 1.5Γ— faster processing. Try 1–2 dots (parallel wins) vs 3+ dots (distributed wins).
Speed:
Add tasks above to get started
click to cycle weight  |  shift+click or drag to multi-select  |  ● = 1 work unit
Serial (1 worker)
Queue
Worker
Done
β€”
Parallel (4 cores, 700ms/unit)
Queue
Workers
Done
β€”
Distributed (4 nodes, 450ms/unit + πŸ“‘600ms)
Queue
Workers
Done
β€”

TL;DR

  • Serial computing executes tasks one after another on a single processor.
  • Parallel computing divides tasks across multiple cores on the same machine, allowing simultaneous execution without network overhead.
  • Distributed computing spreads tasks across multiple machines, which can provide more resources but incurs network communication overhead.
  • The best approach depends on the problem size, resource availability, and performance requirements. For small tasks, parallel computing may outperform distributed due to lower overhead, while for larger tasks, distributed computing can leverage more resources for faster completion.