Debugging High CPU usage on Kubernetes pod

read

We had this Java Spring Boot based micro-service not doing any heavy lifting as such, but on whichever environment we deployed it, it always had close to 200% CPU usage.

For debugging the cause behind such an unexpected behavior, we followed following steps and came across a reason behind it.

Checking CPU Usage

Where did we get to know the fact that the service had a high CPU?

For you can use kubectl top command to see the resource usage by pods.

$kubectl top pods -n a_team | grep the-service
NAME                                                  CPU(cores)   MEMORY(bytes)
the-service-5cdd47bc56-b2qlc                        2000m        1122Mi
the-service-5cdd47bc56-x7tx5                        1998m        1681Mi

We had 2 CPU cores assigned to each pod of the service.

Here you can see 200% CPU usage mentioned as ~2000m values for CPU cores.

Taking the Thread Dump

In the output of kubectl top command response, you can see CPU usage of the-service being very high. Does not matter how many pods we ran, all had a very high CPU usage. As there was no load on the service, there must be something wrong happening within itself that was consuming the resources like this.

From this we knew we had to look at the thread dump of the service, but the app was running in kubernetes and the docker base image it was using did not have jmap in java that it was using. Neither could we install jstack like tool into the running pod.

We had another option though: to send kill -3 signal to the java process of the service which would not kill the app but would print the thread dump to the console/logs.

To capture logs have kubectl logs command running in one terminal. Sample command looks like this:

$kubectl logs -f --tail 100 the-service-5cdd47bc56-b2qlc -n a_team
#a_team: namespace
#the-service-5cdd47bc56-b2qlc: one of the running pods of the service

To send kill signal to java process, we need to exec into the pod.

For that in another terminal run:

$kubectl exec --stdin --tty the-service-5cdd47bc56-b2qlc -n a_team -- /bin/bash

Once into the pod bash, we need to look at processes running with just top command. top command printed the Processes running in the pod.

$top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 app       20   0 9380672 832056  13020 S 197.9  1.3  26358:44 java
  ...
 9105 app       20   0   56192   1984   1444 R   0.0  0.0   0:00.01 top

In response of top command you can see the java process with PID 1 is running with CPU usage close to 200 percent.

Before we send kill -QUIT signal to this process, we need also to look at what threads were consuming the CPU at this level.

To show thread details of the top command, press SHIFT + H while top is running

shift + H on top
 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   50 app       20   0 9380672 832004  13020 R 99.7  1.3  13053:15 java
   26 app       20   0 9380672 832004  13020 R 99.3  1.3  13050:38 java
  ...

Here you can see Threads with PID 50 and 26 were taking close to 100% of the each CPU core.

Now that we have Thread and Process PID, we are ready to take a dump, a thread dump.

For that just run kill command as:

$kill -QUIT {Java Process PID, in our example 1}

When you hit this command from inside the pod, the terminal which is listening to the logs would just print the thread dump to the terminal.

Finding the culprit

Once you have the dump, you need to find the threads which are causing the hight CPU usage.

For that convert PID of threads to HEX representation:

50  ->  0x32
26  -> 0x1a

Then find in the thread dump the thread stack with nid as these HEX values. Those are the threads you are looking for.

"Thread-4" #17 prio=5 os_prio=0 tid=0x00007f33c5133000 nid=0x1a runnable [0x00007f33a02de000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.sleep(Native Method)
        at com.ateam.theservice.MyClass.run(MyClass.java:73)
        at java.lang.Thread.run(Thread.java:748)

"Thread-10" #42 prio=5 os_prio=0 tid=0x00007f33c5096800 nid=0x32 runnable [0x00007f3359ff6000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.sleep(Native Method)
        at com.ateam.theservice.MyClass.run(MyClass.java:73)
        at java.lang.Thread.run(Thread.java:748)

In our case, the Threads which were problematic where running a infinite loop with while(true), and inside they were waiting for a condition. Between these condition checks, threads were sleeping for some time. By mistake the sleep time was misconfigured to be 0 so the threads where actually never sleeping and running forever leaving CPU and us awake.

Debugging High CPU usage on Kubernetes pod

Nilesh Injulkar

Checking CPU Usage

Taking the Thread Dump

Finding the culprit

Written by

Nilesh Injulkar

Supported by

Injulkar Nilesh

Code. Read. Trek.