How to kill zombie processes using GPU ?
Overview
What is a zombie process?
As you know, in Linux OS when we start an application the OS will create a process and this process can start other processes. The process starts other processes is refered as the parent and the new processes are refered as the children. The Linux OS keeps the information of processes in a table called the process table. The parent and the children run almost independently, but sometimes they share some resources (input, output) or contexts. When a child finished its job, it will send a SIGCHLD
signal to the parent. The parent then reads the exit code of the child and removes its entry from the process table, this also cleans the resources used by the child. But there are sometimes the children cannot send the SIGCHLD
signal to the parent or the parent was died by incident, in such cases the children outlive from their parent and the Linux OS refers them as orphaned
or zombie
processes.
The problem of zombie processes
Because of outliving the parent, the resources used by the children (zombie
) cannot be released, and hence, other processes cannot use these resources. To overcome this problem, we need to kill the children manually based on their ids. But, the main question is how we can find the the ids of the children? To answer that question, let’s continue to the next sections.
Killing zombie processes using GPU
Working as a research engineer, I’m usually using GPUs
to train the deep learning models and checking the used resources with nvtop
command. Usually each process using GPU
will has an entry in the nvtop
table and the Linux kernel refers that process as the parent process, the entry consists of some information about that process, for example, PID
- the parent id, USER
- the user that the parent belongs to, GPU
- the GPU id used by the parent…
To kill a process using GPU
we simply use the command kill PID
or kill -9 PID
, but there are some cases we cannot kill the process by that way, for example, the process has PID=18309
in the figure1. This because the process (parent
) is already dead (indicated by N/A
USER column) but the children (orphaned
) are still alive and hold the resouces (in this case, the zombie proceses are holding about 85% GPU MEM). In order to access the child processes you have to excute sudo fuser -v /dev/nvidia*
and all processes using GPUs will be listed with each GPU
id. For example, when running the sudo fuser -v /dev/nvidia*
command on my training server we will see the output looks like:
$ sudo fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia0: nguyenlm 15909 F.... nvtop
nguyenlm 20717 F.... nvtop
nguyenlm 21042 F.... nvtop
root 24536 F.... nvtop
nguyenlm 24787 F...m tensorboard
nguyenlm 25078 F...m python
nguyenlm 25079 F...m python
nguyenlm 25080 F...m python
nguyenlm 25081 F...m python
nguyenlm 25082 F...m python
nguyenlm 25085 F...m python
nguyenlm 32199 F...m python
From the output, we have a dozen of processes using GPU=0
(python, nvtop, tensorboard). Simply, we can kill them all with their PIDs
by the kill
command as mentioned to release the resources. However, we can do that easier by an observation, the zombie processes are usually have consecutive ids, so if we look the output carefully we will see a group of processes has the id ranged from 25078
to 25082
and those actually are zombie PIDs
.