Why does my multiprocessing python program hang?

Why does my multiprocessing python program hang?

Tags
Python
Published
March 2, 2023
Author
yanbc

Help! My Python program hangs

Have you ever had a Python program that just won't behave? You run it, you wait, you wait some more... and nothing happens. It's like the code has a mind of its own and refuses to cooperate.
Well, the other day at work, my colleague Bob came up to me with one such program. It was a multiprocessing script that just wouldn't finish. Bob was pulling his hair out, wondering what was going on.
I took a look at Bob's code, scratched my head for a moment, and then it hit me. It was all to do with Python's multiprocessing queues. And that's what I want to talk to you about today.
But before we dive into the nitty-gritty, let me show you the code that was causing all the trouble.
from multiprocessing import Process, Queue def worker(q: Queue): data = '1' * 105656 q.put(data, block=False) print(f"worker ended") def main(): q = Queue() p = Process(target=worker, args=(q,)) p.start() print("worker started") # q.get() p.join() print("worker joined") print('all done') if __name__ == "__main__": main()

Why multiprocessing?

If you've ever worked on a CPU-bound Python program, you may have heard of the Global Interpreter Lock (GIL). It's a mechanism that prevents multiple threads from executing Python byte codes simultaneously, which can limit performance on multi-core CPUs.
One way to work around the GIL is to use multiprocessing, which allows you to execute Python code across multiple processes instead of threads. The multiprocessing library in Python makes it easy to write code that can take advantage of multiple CPU cores, and it's relatively simple to use.
Bob was so excited about his new multiprocessing program. He had heard that multiprocessing could harness the power of multiple cores to make his program run faster, and he couldn't wait to try it out.
But when he ran the code, here was the output he got:
worker started worker ended
Judging from the printed messages, the child process had finished and exited. But the main program just hung there, refusing to exit. He waited and waited, but nothing happened. (You are encouraged to try it out and see for yourself!) Bob was stumped. What could be the problem? He checked the code, but it looked perfectly fine. He scratched his head, puzzled.
And the strange part is, if you pop the message out before calling p.join(), the program exits normally.
Spoiler alert: You should alway empty a queue before exiting a program.

The investigation

As if the previous problem wasn't enough, Bob noticed another strange behavior with this program. It would exit normally if he sent small message data, but if he sent a larger message, it would just hang there, as if lost in space. He turned to me for help. To pin down the exact data size that would not block the program, I ran the following script
from multiprocessing import Process, Queue import time def worker(q: Queue, data_size: int): # data = [42] * data_size # max data_size: 32726 # data = [3.14] * data_size # max data_size: 7278 # data = [True] * data_size # max data_size: 65386 data = '1' * data_size # max data_size: 65514 q.put(data, block=False) def main(): q = Queue() right = 105656 left = 0 while True: if left >= right - 1: break data_size = (left + right) // 2 print(f"data size: {data_size}") p = Process(target=worker, args=(q, data_size)) p.start() time.sleep(0.1) if p.is_alive(): # it hangs right = data_size else: # it exits left = data_size q.get() p.join() print(f'max data_size: {left}') if __name__ == "__main__": main()
The program exited when the data size is less then 65514 which was really close to 65536 or . Coincident? I don't think so. As you can see, I also tried out a few other data types and the maximum data sizes all turn out to be closed to some factors of 65536.
data type
max data size
factor
char
65514
65536 / 65514 = 1.0003358060872485
integer
32726
65536 / 32726 = 2.0025667664853635
float
7278
65536 / 7278 = 9.004671613080516
boolean
65386
65536 / 65386 = 1.0022940690667728

It's the queues to blame

Ah, the plot thickens! The culprit behind Bob's program hanging was the multiprocessing queue. You see, Python implements the multiprocessing.Queue with pipes on Linux systems. And pipes have a default buffer size of 16 pages, which equals to 65536 bytes. Writing a message larger than this size will block the writer.
To make things worse, the queue starts a thread to do the actual data writing. The intention, I guess, was to allow codes to continue executing without waiting on the slow pipe io write operation. But now joining this thread becomes part of the cleaning process. In Bob's code, the child process starts the this process after printing out the message at line 6. It calls join on the data writing thread internally but this thread won't quit because it still has data to write. This join call is what blocks the child process and consequently blocks the main process from exiting.
Emptying the queue by calling q.get() essentially removes this data from the buffer. It allows the writer to finish writing the data to the underlying pipe and then exit. That's why the child process join returns if you pop the message out before calling it.
But why does the program hangs at 65514 instead of 65536 for char data type? Well, after some digging into the source codes, here's what I found,
  1. Before the data writing thread actually writes the data, it first needs to tell the receiving end how many bytes of data to be expected. This information normally takes up 4 bytes.
  1. All data is serialized before being written to the pipe. Pickle is the default serializer to be used.
Here is a script to confirm my finding
from typing import Any import pickle as pkl import struct def get_raw_data(data: Any) -> bytes: ''' This is a simplified version of the data preparing process. For more information, checkout multiprocessing.connection.Connection._send_bytes ''' serialized_data = pkl.dumps(data) header = struct.pack("!i", len(serialized_data)) return header + serialized_data if __name__ == "__main__": char_data = '1' * 65514 int_data = [42] * 32726 float_data = [3.14] * 7278 bool_data = [True] * 65386 print(f"char data byte size: {len(get_raw_data(char_data))}") print(f"integer data byte size: {len(get_raw_data(int_data))}") print(f"float data byte size: {len(get_raw_data(float_data))}") print(f"boolean data byte size: {len(get_raw_data(bool_data))}")
It prints out
char data byte size: 65536 integer data byte size: 65536 float data byte size: 65536 boolean data byte size: 65536
Case closed.

The take away lesson

When using multiprocessing in Python, you need to be aware of how multiprocessing queues work. Multiprocessing queues can cause your program to hang if you don't handle them properly, and this can be particularly tricky to debug.
One important thing to remember is that you should always empty the queue before exiting your program. If there are any items left in the queue, your program may not exit cleanly and can hang indefinitely.
Another thing to keep in mind is that when putting an object onto a multiprocessing queue, the object might not actually leave the process memory space even though the function call has returned. This can lead to out-of-memory errors that are difficult to diagnose.