How to kill zombie processes using GPU ?

Minh Nguyen Le

Last updated on Jun 30, 2022 3 min read Linux, OS

Overview

What is a zombie process?

As you know, in Linux OS when we start an application the OS will create a process and this process can start other processes. The process starts other processes is refered as the parent and the new processes are refered as the children. The Linux OS keeps the information of processes in a table called the process table. The parent and the children run almost independently, but sometimes they share some resources (input, output) or contexts. When a child finished its job, it will send a SIGCHLD signal to the parent. The parent then reads the exit code of the child and removes its entry from the process table, this also cleans the resources used by the child. But there are sometimes the children cannot send the SIGCHLD signal to the parent or the parent was died by incident, in such cases the children outlive from their parent and the Linux OS refers them as orphaned or zombie processes.

The problem of zombie processes

Because of outliving the parent, the resources used by the children (zombie) cannot be released, and hence, other processes cannot use these resources. To overcome this problem, we need to kill the children manually based on their ids. But, the main question is how we can find the the ids of the children? To answer that question, let’s continue to the next sections.

Killing zombie processes using GPU

Working as a research engineer, I’m usually using GPUs to train the deep learning models and checking the used resources with nvtop command. Usually each process using GPU will has an entry in the nvtop table and the Linux kernel refers that process as the parent process, the entry consists of some information about that process, for example, PID - the parent id, USER - the user that the parent belongs to, GPU - the GPU id used by the parent…

To kill a process using GPU we simply use the command kill PID or kill -9 PID, but there are some cases we cannot kill the process by that way, for example, the process has PID=18309 in the figure1. This because the process (parent) is already dead (indicated by N/A USER column) but the children (orphaned) are still alive and hold the resouces (in this case, the zombie proceses are holding about 85% GPU MEM). In order to access the child processes you have to excute sudo fuser -v /dev/nvidia* and all processes using GPUs will be listed with each GPU id. For example, when running the sudo fuser -v /dev/nvidia* command on my training server we will see the output looks like:

$ sudo fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        nguyenlm  15909 F.... nvtop
                     nguyenlm  20717 F.... nvtop
                     nguyenlm  21042 F.... nvtop
                     root      24536 F.... nvtop
                     nguyenlm  24787 F...m tensorboard
                     nguyenlm  25078 F...m python
                     nguyenlm  25079 F...m python
                     nguyenlm  25080 F...m python
                     nguyenlm  25081 F...m python
                     nguyenlm  25082 F...m python
                     nguyenlm  25085 F...m python
                     nguyenlm  32199 F...m python

From the output, we have a dozen of processes using GPU=0 (python, nvtop, tensorboard). Simply, we can kill them all with their PIDs by the kill command as mentioned to release the resources. However, we can do that easier by an observation, the zombie processes are usually have consecutive ids, so if we look the output carefully we will see a group of processes has the id ranged from 25078 to 25082 and those actually are zombie PIDs.

References:

tts learning