Long JVM pauses without GC

Folks using Java for systems engineering might already be familiar with GC (Garbage Collection) and how GC logs can be important for debugging performance and related issues. GC activity can severely impact the dynamics of system built in Java. If it’s a distributed system, then impact could be worse since there are several moving components of the system that need to talk to each other.

In this post, I will give insight into a very important aspect of GC logging that engineers should be aware of when trying to understand the logs and looking for potential issues.

Stop-the-world GC operationscan lead to long JVM pauses as application threads are suspended while JVM does garbage collection.

Let’s look at a sample snippet from GC log indicating GC activity and the corresponding JVM pause:

[GC pause (G1 Evacuation Pause) (young)
Desired survivor size 54525952 bytes, new threshold 8 (max 15)
- age   1:   10867224 bytes,   10867224 total
- age   2:    5046712 bytes,   15913936 total
- age   3:   17522776 bytes,   33436712 total
- age   4:    2174200 bytes,   35610912 total
- age   5:    1407704 bytes,   37018616 total
- age   6:   11410824 bytes,   48429440 total
- age   7:    3577488 bytes,   52006928 total
- age   8:   15038040 bytes,   67044968 total
- age   9:   17797760 bytes,   84842728 total
 2064.416: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 11711, predicted base time: 24.56 ms, remaining time: 175.44 ms, target pause time: 200.00 ms]
 2064.416: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 215 regions, survivors: 26 regions, predicted young region time: 159.57 ms] 
.............
.............
.............
.............
.............
[Eden: 1720.0M(608.0M)->0.0B(9720.0M) Survivors: 208.0M->104.0M Heap: 3161.5M(16.0G)->1373.6M(16.0G)]
[Times: user=0.62 sys=0.12, real=0.05 secs] 
Total time for which application threads were stopped: 0.0514601 seconds, Stopping threads took: 0.0000901 seconds

Looking at the last two lines tells us how long the GC activity took and the time application threads were stopped for.

[Times: user=0.62 sys=0.12, real=0.05 secs]
Total time for which application threads were stopped: 0.0514601 seconds, Stopping threads took: 0.0000901 seconds

As indicated by the log, the pause lasted 0.05secs. Like this we can see other GC activity in the log to find out significant pauses.

Until very recently, I typically had one major purpose when looking at GC logs:

Find out the substantial pauses (as indicated by total time for which application threads were stopped ) and then try to understand from code, heap dumps etc why our application is experiencing long GC pause.

However, another equally important metric that I used to ignore is the time taken to stop the threads and whether the pause was due to GC or not . Let me give some important background before diving into this metric.

JVM Safepoints

JVM has a concept of safepoint. Safepoint is a point in execution of a thread where state of the thread is well defined. All the object references used by the thread are accounted for by the JVM. Ensuring that a thread has entered a safepoint stage allows JVM to run some bookkeeping and other operations that might manipulate heap and stack space. Therefore, once the thread exits safepoint stage, any operation(s) run by JVM in the meantime wouldn’t have changed the thread’s view of memory.

However, if JVM ran any maintenance operation without ensuring that thread(s) have reached a safe stage, there is a possibility of corrupting the thread’s view of memory. As an example, JVM might collect a live object if it was not accounted for as of that time. By knowing the state of thread (what is there in registers, stack, heap etc), safepoints help JVM avoid such problems.

What JVM operations require threads to enter safepoint?

Few operations that require threads to enter safepoint stage:

  • Stop the world GC operations
  • Heap dump
  • Flushing the code cache
  • Method de-optimization

How does a thread reach safepoint stage?

JVM doesn’t force a thread to enter safepoint stage. Instead, it lets threads poll a safepoint flag at some points in the byte-code. Specifically, the flag indicates whether the thread should enter safepoint or not. Accordingly, the thread suspends itself if the flag indicates so. The JVM implementation inserts “safepoint polls” at appropriate places. It is up to the JVM to ensure that following are highly optimized to minimize the impact on application performance:

  • Number of safepoint polls.
  • The cost of polling/checking the flag.

Few examples of states of a threads when it has already reached safepoint stage:

  • Thread waiting on monitor/lock/blocking IO is in a safepoint state. Before the transition to blocked happened (to wait for I/O or lock), the thread was brought to safepoint state.
  • A thread executing native code via JNI – even though the thread is running, before the execution was transferred to JNI land, the thread was brought to a safepoint state.

Time taken by threads to reach safepoint

Now that we have some reasonable context on safepoints, let’s get back to their relationship with GC logs. Since JVM doesn’t force the threads to enter safepoint, we don’t know the number of instructions the thread executes before it hits a safepoint poll and enters safepoint. Thus it is possible (in theory at least) that threads might taken their own sweet time to reach safepoint.

One thing that readers should remember is that using the JVM option XX:+PrintGCApplicationStoppedTime will report pause times in GC logs along with the time taken to pause the threads. Note that these two metrics will be reported in the GC logs regardless of the fact that JVM operation that triggered the pause was GC or not.

Going back to the GC log snippet shown earlier:

Total time for which application threads were stopped: 0.0514601 seconds, Stopping threads took: 0.0000901 seconds

We should interpret it as follows: the threads were stopped for 0.05secs out of which 0.00009secs were taken to stop the threads.

If threads are taking a long time to reach safepoint, the “Stopping threads took” metric will be higher and consequently “total time for which application threads were stopped” will also be higher since it subsumes the former.

If a thread is taking a long time to reach safepoint , it will derail the JVM — since the threads that have reached the safepoint will be suspended forever until (1) offending thread reaches safepoint and (2) JVM executes the operation that triggered the safepoint in first place and finally (3) JVM lets threads exit safepoint and resume execution.

Some major factors that can increase the time taken by threads to reach safepoint:

  • High degree of page faults – during a page fault, a thread is suspended by the OS until the requested page is retrieved from the backing store, entered into the page table etc. The suspended thread(s) will not be able to check for safepoint polls until they resume execution. So if the JVM has really initiated a safepoint, all the other threads that have reached safepoint will have to wait until this thread reaches safepoint and JVM completes the safepoint operation.
  • High ratio of number of threads to cores — increases the CPU contention and excessive context switching.
  • Uncounted loops – Loops where the counter variable is of INT type are treated as uncounted loops by the JVM and it does not insert a safepoint poll in the loop to speed up the execution of such loops. So if an enormous computation is being done in an uncounted loop, the calling thread will not be able to check for safepoint poll until loop execution is over.

Takeaways for examining GC logs

Make sure we are using at least the following JVM options:

  • -XX:+PrintGCDetails
  • -XX:+PrintGCTimeStamps
  • -XX:+PrintGCApplicationStoppedTime

At a high level, we should look for a couple of patterns:

GC activity and corresponding pauses

[GC pause (G1 Evacuation Pause) (young)
Desired survivor size 54525952 bytes, new threshold 8 (max 15)
- age   1:   10867224 bytes,   10867224 total
- age   2:    5046712 bytes,   15913936 total
............
............
............
[Times: user=0.62 sys=0.12, real=0.05 secs] 
Total time for which application threads were stopped: 0.0514601 seconds, Stopping threads took: 0.0000901 seconds

More specifically, this pattern will have info on GC activity and then corresponding pause time followed by next GC activity and pause time. Something like:

  • GC activity
  • Pause (including the time taken to stop the threads)
  • GC activity
  • Pause (including the time taken to stop the threads)
  • and so on….

Look at both “total time taken for which the application threads were stopped” and “stopping threads took” .

Pauses without GC activity

By now the readers should know that apart from GC, there are other JVM internal operations as well that will require threads to reach a safepoint. Since we are using -XX:+PrintGCApplicationStoppedTime option, such pauses will also be present in the GC log. Between each successive pause, there will be no GC activity and that is how we need to identify such pauses from the log. The following snippet shows an example:

Total time for which application threads were stopped: 46.7196326 seconds, Stopping threads took: 46.6686622 seconds
Total time for which application threads were stopped: 43.5977492 seconds, Stopping threads took: 43.4309684 seconds
Total time for which application threads were stopped: 30.3403226 seconds, Stopping threads took: 30.2303471 seconds
Total time for which application threads were stopped: 5.6498432 seconds, Stopping threads took: 5.5128792 seconds
Total time for which application threads were stopped: 1.2957247 seconds, Stopping threads took: 1.2336227 seconds

As shown above, between each successive pause, there is no GC activity. So such pauses are not GC related.

Couple of things that we should think about investigating:

  • Firstly, we need to find out what is causing the JVM to initiate safepoint. The following JVM options can be helpful. They will print the details of the safepoint (including reason, timings etc).
    • -XX:+PrintSafepointStatistics
    • -XX:PrintSafepointStatisticsCount=1
    • Note that information for these options is logged to stdout(not GC log).
  • Secondly, we need to see if threads took a long time to reach safepoint. This is true for the example shown above:
    • In one case, it took 46secs to stop the threads whereas the total time for which the threads were stopped was just a little more than 46secs.
    • Clearly, whatever safepoint operation led to the above pause lasted for negligible period.
    • We need to investigate why threads took a long time to reach safepoint.

Few additional options that can be useful:

  • -XX:+SafepointTimeout
  • XX:SafepointTimeoutDelay=100

Using these options will help us debug the case where the threads are taking longer than usual to reach safepoint. This will timeout the safepoint sync if a thread fails to reach the safepoint in 100ms. It will also print information where exactly thread is spending time (e.g page fault trap) that is preventing it from reaching safepoint.

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章