Optimizing Suspending Functions in Kotlin

This is the second post in a series in which I share lessons learned implementing a performance-oriented cooperative multitasking library with the low level Kotlin coroutines API.

For this second post, I'll outline some challenges and how to overcome them when writing suspending functions using the low level coroutines API that might be used in tight loops or other high traffic use cases. For more background, see the first post in this series.

Note : Code and results in this post based on Kotlin 1.2.31.

The full code with tests to reproduce the results from this post yourself is available at https://github.com/pronghorn-tech/blog-post-queues

Setting the Stage

Let's imagine we want to spawn two coroutine. The first takes from a queue whenever work is available, processes it as quickly as possible, and suspends execution when the queue is empty. The other coroutine does the opposite, adding to the queue and suspending execution when full. Those familiar with the high level Kotlin coroutine API will probably recognize this as very similar to that library's Channel interface. In the interest of keeping things simple, let's further assume that we expect both coroutines to run on the same thread.

Kotlin Channel Implementation

Using the previously mentioned Channels this might look something like this:

val channel = Channel<Long>(1024)
val context = newSingleThreadContext("workerThread")

launch(context) {
    while (true) {
        val value = channel.receive()
        process(value)
    }
}

launch(context) {
    var x = 0L
    while(true){
        channel.send(x)
        x += 1
    }
}

Code Snippet 1

This is hopefully pretty straight-forward, so let's consider how well it performs. Throughout this post we'll look at both throughput and memory allocations, since in a high performance JVM application avoiding gratuitous allocations can minimize garbage collection pauses and thus avoid latency spikes. Quoted thoughput numbers are on an i7 6700k test machine with memory allocations calculated via VisualVM.

Kotlin Channel Implementation Performance
Throughput after warm-up for this standard library implementation is approximately 19.8 million values processed per second. There are also roughly 24 bytes allocated per processed value, which has more to do with boxing of the Longs than anything else. All that is pretty good, but perhaps we can do better.

We'll start with an implementation that looks something like a BlockingQueue, but instead of blocking it will suspend when empty and full. Taking a cue from the queue, we'll start with this interface:

interface SuspendingQueue {
    suspend fun take(): T
    suspend fun add(value: T): Boolean
}

Code Snippet 2

There's nothing particularly interesting about this interface except that both functions are marked with the suspend modifier, which requires them to be called from within a coroutine and allows them to suspend execution of said coroutine.

An extremely naïve implementation of this interface that wraps a standard ArrayBlockingQueue might look like this:

class NaiveSuspendingQueue(capacity: Int) : SuspendingQueue {
    private val underlying: Queue = ArrayBlockingQueue(capacity)
    private var fullWaiter: Continuation? = null
    private var emptyWaiter: Continuation? = null

    override suspend fun add(value: T): Boolean {
        val emptyWaiter = this.emptyWaiter
        if (emptyWaiter != null) {
            this.emptyWaiter = null
            emptyWaiter.resume(value)
        }
        else {
            while (!underlying.offer(value)) {
                suspendCoroutine { continuation -> fullWaiter = continuation }
            }
        }
        return true
    }

    override suspend fun take(): T {
        val result = underlying.poll()
        if (result != null) {
            val fullWaiter = this.fullWaiter
            if (fullWaiter != null) {
                this.fullWaiter = null
                fullWaiter.resume(Unit)
            }
            return result
        }
        else {
            val suspendResult = suspendCoroutine { continuation -> emptyWaiter = continuation }
            return suspendResult
        }
    }
}

Code Snippet 3

Ok, so there's a lot to unpack, even in this naïve implementation.

The add function starts by checking to see if there is a coroutine waiting for the result it's about to add. If so, rather than actually interact with the underlying queue at all it immediately provides that value to the waiting coroutine via calling resume on its Continuation. This direct call to resume immediately interrupts the current execution and switches to waiting coroutine that suspended to create the Continuation.

If instead there is no waiting coroutine, the value is offered to the underlying queue. In the event that the queue is full, we suspend execution via suspendCoroutine. We'll get into exactly what is happening under the covers of that function in a bit, but for the moment assume it gives us access to a Continuation object with which we can resume execution where we left off later. In this case, when resumed the offer will be attempted again, suspending as necessary until it succeeeds. It's worth noting that the Continuation has a type parameter indicating what sort of value it expects to receive upon resumption. In this case we just want to be triggered to attempt our offer again, so a Continuation<Unit> will suffice.

The take function starts by polling the queue for the next value. If a result was immediately available that means we've potentially relieved the full condition. Thus a quick check lets us resume any coroutine suspending due to the queue being full. This again happens directly in place.

Otherwise, if the queue is empty take suspends execution, saving its Continuation. Note that this time we expect to be provided a value, so we use a Continuation<T>.

Naïve Implementation Performance
Throughput for this implementation is virtually identical to the Kotlin Channel implementation on my test machine at 19.7 million per second, but when looking at memory allocation something has clearly gone horribly awry! We're now chewing through almost 140 bytes per processed value.

So what's going on here? With VisualVM we can see that we're now allocating a ton of instances of classes named NaiveSuspendingQueue$take$1 and NaiveSuspendingQueue$add$1. While not immediately obvious, in order to arbitrarily suspend execution Kotlin creates a state machine for each suspending function. Next we'll look at ways to mitigate this.

Fewer Allocations Implementation

So what can be done about the allocations happening in our naïve implementation? The cleanest way to avoid this is to indicate to the compiler that the state machine isn't necessary because we don't intend to suspend in the middle of this function. According to the Kotlin coroutines specification if we only invoke suspend functions in tail position it can skip creating the state machine. Indeed if we look at the high level Kotlin coroutine API it makes heavy use of this technique.

The astute observer will notice on line 31 of Code Snippet 3 the take function had an unnecessary assignment to suspendResult. Without this, the tail call optimization would have been used, so it was only in that example to demonstrate this issue.

Improving add is a little more complicated. The need to loop until success when attempting to offer the value makes it impossible for a simple tail call. Often the best option is to isolate the suspending action into its own function to invoke in tail position only if necessary.

Applying these improvements yields:

class FewerAllocationsQueue(capacity: Int) : SuspendingQueue {
    private val underlying: Queue = ArrayBlockingQueue(capacity)
    private var fullWaiter: Continuation? = null
    private var emptyWaiter: Continuation? = null

    override suspend fun add(value: T): Boolean {
        val emptyWaiter = this.emptyWaiter
        if (emptyWaiter != null) {
            this.emptyWaiter = null
            emptyWaiter.resume(value)
            return true
        }
        else if(underlying.offer(value)){
            return true
        }
        return suspendAdd(value)
    }

    private tailrec suspend fun suspendAdd(value: T): Boolean {
        suspendCoroutine { fullWaiter = it }
        if(!underlying.offer(value)) {
            return suspendAdd(value)
        }
        return true
    }

    override suspend fun take(): T {
        val result = underlying.poll()
        if (result != null) {
            val fullWaiter = this.fullWaiter
            if (fullWaiter != null) {
                this.fullWaiter = null
                fullWaiter.resume(Unit)
            }
            return result
        }
        return suspendTake()
    }

    private suspend fun suspendTake(): T {
        return suspendCoroutine { emptyWaiter = it }
    }
}

Code Snippet 4

The biggest change here was the addition of the suspendAdd function to isolate the suspend that we've now placed at the end of add and suspendTake which serves the same purpose for take.

Hint : We could have left the suspendCoroutine call at the end of take, but isolating it is often a good idea as the bytecode generated by a suspend call is substantial and by isolating it we help the JIT optimize and/or inline suspendTake separately.

Hint : If you're using IntelliJ, a quick way to determine if your function is generating a state machine is to use the "show Kotlin bytecode" feature of the Kotlin plugin. There are several indications in this generated bytecode that a state machine has been generated. Within the function in question itself you can look for the easy to see line LDC "call to 'resume' before 'invoke' with coroutine", or calls to setLabel and getLabel on the generated state machine class. You can also look at the end of the file and find the generated class itself which will look something like packageName/className$functionName$1 extends kotlin/coroutines/experimental/jvm/internal/CoroutineImpl.

Fewer Allocations Implementation Performance
After implementing these minor tweaks to prevent unnecessary state machine creation, our new queue is up to 26.1 million values per second, and our memory allocations match those of the Kotlin Channel implementation, about 24 bytes per value.

*Hint : An alternative strategy for removing potentially extraneous state machine allocations is to use the inline modifier to force inlining of nested suspend functions. However, this should be done only after careful testing since forced inlining can impact JIT performance in unobvious ways. As the compiler does not recognize this as an optimization it will generate a NOTHING_TO_INLINE warning. Also note that the inline modifier requires any fields accessed in the inlined method to be visible at the call site, but this can be worked around with use of the @PublishedApi annotation.

Scheduler Based Implementation

At this point it's worth considering exactly how execution would flow with the current implementation. Since both add and take resume any waiting Continuation in place as soon as possible, our earlier pair of coroutines (one sender and one receiver) ping pong back and forth, and in practice the underlying queue barely gets touched. What we currently have more closely resembles the the standard library's Rendevouz Channel.

In order to allow the sender coroutine to actually place multiple items into the queue before switching to the receiver coroutine, we must introduce some other mechanism to control the resumption of our coroutines. For a brief primer on this, see the previous post in this series An Introduction to Cooperative Multitasking with Coroutines. The actual implementation details of a scheduler are beyond the scope of this post, though expect that to be a topic for a future post in this series. For now, let's assume the following very simple interface:

interface Scheduler {
    fun  scheduleContinuation(continuation: Continuation,
                                 value: T)
}

Code Snippet 5

For the purposes of this post, assume that when a coroutine suspends, control yields to a Scheduler which does nothing more than resume the next coroutine that has indicated it is ready to resume.

Swapping our queue over to use utilize this scheduler involves very few changes, resulting in the following implementation:

class SchedulerQueue(val scheduler: Scheduler,
                        capacity: Int) : SuspendingQueue {
    private val underlying: Queue = ArrayBlockingQueue(capacity)
    private var fullWaiter: Continuation? = null
    private var emptyWaiter: Continuation? = null

    override suspend fun add(value: T): Boolean {
        val fullWaiter = this.emptyWaiter
        if (fullWaiter != null) {
            this.emptyWaiter = null
            scheduler.scheduleContinuation(fullWaiter, value)
            return true
        }
        else if (underlying.offer(value)) {
            return true
        }
        return suspendAdd(value)
    }

    private tailrec suspend fun suspendAdd(value: T): Boolean {
        suspendCoroutine { fullWaiter = it }
        if (!underlying.offer(value)) {
            return suspendAdd(value)
        }
        return true
    }

    private suspend fun suspendTake(): T {
        return suspendCoroutine { emptyWaiter = it }
    }

    override suspend fun take(): T {
        val result = underlying.poll()
        if (result != null) {
            val fullWaiter = this.fullWaiter
            if (fullWaiter != null) {
                this.fullWaiter = null
                scheduler.scheduleContinuation(fullWaiter, Unit)
            }
            return result
        }

        return suspendTake()
    }
}

Code Snippet 6

Scheduler Based Queue Implementation Performance
With this change we're up to 31.2 million values per second, which is rapidly approaching the throughput of the raw underlying ArrayBlockingQueue. Since our stated use case has both our coroutines running on the same thread, we can swap in a much more efficient wait-free queue. In this case we'll use a simple ring buffer based queue. The details of this implementation are beyond the scope of this post, but the full code is available at https://github.com/pronghorn-tech/common/blob/master/src/main/kotlin/tech/pronghorn/collections/RingBufferQueue.kt.

With this much faster underlying queue we're able to push through a much more impressive 124.7 million values per second.

Coroutine Intrinsics Implementation

We've made a massive improvement in performance since we started, but there's still a little more we can squeeze out if we're willing to dive in and utilize the coroutine intrinsics available in the low level coroutine API. I would recommend reading the documentation before using these methods, as they come with significant warnings about their misuse. In fact the documentation states "The contents of kotlin.coroutines.experimental.intrinsics package are hidden from auto-completion in Kotlin plugin for IDEA to protect them from accidental usage."

All that said, if we look at the code for the suspendCoroutine method we've been using to suspend our coroutines up until now, we see that it's creating an extra SafeContinuation wrapper around our Continuation that would allow us to resume that coroutine from the same stack frame. Since we're not doing that, we can instead utilize the low level suspendCoroutineOrReturn intrinsic. If we want to take that even a step further, there is also a suspendCoroutineUninterceptedOrReturn intrinsic, which bypasses the coroutine interceptor capability of the coroutine framework. Since we're handling our own scheduling directly, we can utilize this method.

Our final implementation looks like this:

class IntrinsicsQueue(val scheduler: Scheduler,
                         capacity: Int) : SuspendingQueue {
    private val underlying: Queue = RingBufferQueue(capacity)
    private var fullWaiter: Continuation? = null
    private var emptyWaiter: Continuation? = null

    private fun offer(value: T): Boolean {
        val fullWaiter = this.emptyWaiter
        if (fullWaiter != null) {
            this.emptyWaiter = null
            scheduler.scheduleContinuation(fullWaiter, value)
            return true
        }
        return underlying.offer(value)
    }

    override suspend fun add(value: T): Boolean {
        if (offer(value)) {
            return true
        }
        return suspendAdd(value)
    }

    private tailrec suspend fun suspendAdd(value: T): Boolean {
        suspendCoroutineUninterceptedOrReturn { continuation ->
            fullWaiter = continuation
            COROUTINE_SUSPENDED
        }
        if (underlying.offer(value)) {
            return true
        }
        return suspendAdd(value)
    }

    private suspend fun suspendTake(): T {
        return suspendCoroutineUninterceptedOrReturn { continuation ->
            emptyWaiter = continuation
            COROUTINE_SUSPENDED
        }
    }

    private fun poll(): T? {
        val result = underlying.poll()
        if (result != null) {
            val fullWaiter = this.fullWaiter
            if (fullWaiter != null) {
                this.fullWaiter = null
                scheduler.scheduleContinuation(fullWaiter, Unit)
            }
        }
        return result
    }

    override suspend fun take(): T {
        val result = poll()
        if (result != null) {
            return result
        }
        else {
            return suspendTake()
        }
    }
}

Code Snippet 7

The main change here is utilizing suspendCoroutineUninterceptedOrReturn in place of suspendCoroutine. Important to note is that in order to actually suspend execution the lambda passed to suspendCoroutineUninterceptedOrReturn must return the special COROUTINE_SUSPENDED value from the kotlin.coroutines.experimental.intrinsics package. Otherwise the returned value is immediately returned.

The other change made in this implementation is further breaking down add and take. As before smaller functions give the JIT more opportunities to optimize, and these particular changes provide an opportunity to discuss some potential gotchas concerning when the compiler can and can't optimize tail calls. The following gotchas apply as of Kotlin 1.2.31, but there are open issues for them in the Kotlin bug tracker, so hopefully they'll be addressed in future releases.

Tail call gotcha #1 : being on the right side of a returned or condition is not optimized
While the following seems equivalent to the add method in this last implementation, the suspendAdd call is not considered in tail position, and so is not optimized.

override suspend fun add(value: T): Boolean {
        return offer(value) || suspendAdd(value)
}

Code Snippet 8

Tail call gotcha #2 : return lifted out of conditional is not optimized
On lines 56 - 61 of Code Snippet 7 we have a condition where both branches contain a return statement. IntelliJ will helpfully suggest that the return be lifted out of the conditional resulting in the following:

return if (result != null) {
    result
}
else {
    suspendTake()
}

Code Snippet 9

However this is another instance where the compiler will not recognize this suspend function as being in tail position resulting in unoptimized code.

Tail call gotcha #3 : suspend call on the right side of elvis operator is not optimized
Similar to the previous examples, the following implementation of take would not be tail call optimized:

override suspend fun take(): T {
    return poll() ?: suspendTake()
}

Code Snippet 10

Coroutine Intrinsics Implementation Performance
Our finally implementation utilizing the low level coroutine intrinsics manages 134.6 million values per second.

Conclusions

Implementation PerformanceThroughout this post we've seen ways to improve the performance of code utilizing Kotlin coroutines. Most of these improvements are applicable whether you're utilizing Kotlin's own high level API or your own custom coroutine library.

The single-threaded use case presented in this post may not have obvious application, but the Pronghorn Coroutine library makes use of this pattern. Also, using these fundamentals it's possible to create multi-threaded implementations with the same or better performance. Look out for future posts in this series for more info.