Running 1 million concurrent tasks, how much memory is required for each language

By: Piotr Ko?aczkowski, DataStax Corporation (a database system developer in the United States)

See the original text: https://pkolaczk.github.io/memory-consumption-of-async/

In this blog post, we explore memory consumption comparisons in popular languages like Rust, Go, Java, C#, Python, Node.js, and Elixir when dealing with a large number of network connections.

Not long ago, I encountered a concurrency requirement that had to be compared with several computer programs to handle a large number of network connections. I saw that the memory consumption of various programs when processing a large number of network connections was very different, even more than 20 times!

Some of these programs rarely consume more than 100MB, but some programs (at 10,000 concurrent connections) consume up to 3GB. However, due to the complexity of these programs and the different characteristics of the various frameworks, it is difficult to compare them directly. draw meaningful conclusions, as this is not a direct comparison of one apple to another.

So I came up with the idea of creating a synthetic benchmark to compare the memory consumption of various programming languages.

Benchmark

Below, we have written the following programs using different programming languages:

Start N concurrent tasks, each task waits for 10 seconds, and then the program exits after all tasks are completed. The number of tasks is controlled by command line arguments.

Because we can now use ChatGPT, we can easily use those programming languages that are rarely used at ordinary times, and write this kind of program in a few minutes. In order to make it easier for everyone to use, it has been released on github, and the test code is also attached.

Address: https://github.com/pkolaczk/async-runtimes-benchmarks

Rust

In Rust, three solutions are written, the first adopts the traditional multi-threading mode:

let mut handles = Vec::new();
for _ in 0..num_threads {
    let handle = thread::spawn(|| {
        thread::sleep(Duration::from_secs(10));
    });
    handles. push(handle);
}
for handle in handles {
    handle. join(). unwrap();
}

Both of the latter two methods adopt the asynchronous mode, using the tokio framework and the async-std framework respectively:

The following is the asynchronous code of the tokio framework:

let mut tasks = Vec::new();
for _ in 0..num_tasks {
    tasks.push(task::spawn(async {
        time::sleep(Duration::from_secs(10)).await;
    }));
}
for task in tasks {
    task. await. unwrap();
}

Finally, because the async-std solution is very similar to the implementation of the tokio method, it will not be posted directly here.

Go

In Go, the common building blocks of concurrency are used: goroutines, of course we will not use them alone, but use WaitGroup:

var wg sync.WaitGroup
for i := 0; i < numRoutines; i ++ {
    wg. Add(1)
    go func() {
        defer wg. Done()
        time. Sleep(10 * time. Second)
    }()
}
wg. Wait()

Java

Java has traditionally used threads to deal with such problems, but JDK 21 provides virtual threads, which are similar to the concept of goroutines. Therefore, we created two variants of the benchmark. Because I’m also curious how Java threads compare to Rust’s threads.

List<Thread> threads = new ArrayList<>();
for (int i = 0; i < numTasks; i ++ ) {
    Thread thread = new Thread(() -> {
        try {
            Thread. sleep(Duration. ofSeconds(10));
        } catch (InterruptedException e) {
        }
    });
    thread. start();
    threads. add(thread);
}
for (Thread thread : threads) {
    thread. join();
}

Here’s a variant with dummy threads. Notice how similar it is! Almost identical!

List<Thread> threads = new ArrayList<>();
for (int i = 0; i < numTasks; i ++ ) {
    Thread thread = Thread. startVirtualThread(() -> {
        try {
            Thread. sleep(Duration. ofSeconds(10));
        } catch (InterruptedException e) {
        }
    });
    threads. add(thread);
}
for (Thread thread : threads) {
    thread. join();
}

C#

C#, like Rust, has first-class support for async/await:

List<Task> tasks = new List<Task>();
for (int i = 0; i < numTasks; i ++ )
{
    Task task = Task.Run(async () =>
    {
        await Task. Delay(TimeSpan. FromSeconds(10));
    });
    tasks. Add(task);
}
await Task.WhenAll(tasks);

Node. JS

const delay = util.promisify(setTimeout);
const tasks = [];

for (let i = 0; i < numTasks; i ++ ) {
    tasks.push(delay(10000);
}

await Promise. all(tasks);

Python

Python uses the async/await feature

async def perform_task():
    await asyncio. sleep(10)


tasks = []

for task_id in range(num_tasks):
    task = asyncio. create_task(perform_task())
    tasks.append(task)

await asyncio. gather(*tasks)

Elixir

Elixir is also known for its async functions:

tasks =
    for _ <- 1..num_tasks do
        Task.async(fn ->
            :timer. sleep(10000)
        end)
    end

Task.await_many(tasks, :infinity)

Test environment

Hardware: Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz
OS: Ubuntu 22.04 LTS, Linux p5520 5.15.0-72-generic
Rust: 1.69
Go: 1.18.1
Java: OpenJDK “21-ea” build 21-ea + 22-1890
.NET: 6.0.116
Node.JS: v12.22.9
Python: 3.10.6
Elixir: Erlang/OTP 24 erts-12.2.1, Elixir 1.12.2

All programs are started using release mode (if available). Other options are left as default.

Test results

Minimum footprint

Let’s start with the least overhead. Since some of the program’s runtimes require some memory themselves, let’s start with just one task.

Fig.1: Peak memory required to start a task

In the case of minimal requirements, we can clearly see that there are two types of programs:

Programs statically compiled to native binaries like Rust and Go require very little memory at startup;
However, programs that run through an interpreter need to consume more memory at startup… Although Python has shown good results in this case, compared with statically compiled programs, the difference between these two types of programming languages There is still a clear gap in memory consumption, about an order of magnitude.

We’re surprised that .NET somehow has the worst effects, but these things should be tweakable with a few settings. Since we are not very familiar with .NET development and tuning, please let me know in the comments if you have any tips for this. Overall, I haven’t seen much difference between debug mode and release mode.

10,000 concurrent tasks

Fig.2: Peak memory required to start 10,000 concurrent tasks

Here are some surprises! Anyone with previous development experience might have expected that using threads would be the big loser in this benchmark. The same is true for the Java thread in particular, which does consume almost 250 MB of RAM.

But again using threads, Rust does much better: Rust uses native Linux threads, so it seems lightweight enough that at 10k threads, the memory consumption is still lower than the idle memory consumption of many other runtimes. Java’s asynchronous tasks or virtual thread effects also seem to be lighter than native threads, but we won’t see this advantage in simple tasks with only 10k concurrency. We need to put more pressure on them.

Another surprise here is Go. Goroutines are supposed to be very lightweight, but they actually consume up to 50% more RAM than Rust threads. Honestly, I was expecting a conclusion in favor of Go given the larger variance.

Therefore, the first conclusion can be drawn here, at least in 10,000 concurrent tasks, threads are still a very competitive choice. We guess, the Linux kernel must be doing something here.

Go also lost its small async edge over Rust in this benchmark, and now consumes more than 6x more memory than the best Rust program. Even it is surpassed by Python.

The last surprise is that in the 10k concurrent task, .NET’s memory consumption did not increase significantly due to free memory usage. We guess that the memory consumed at startup may be just that it is using pre-allocated memory. But there is a possibility that its free memory usage is so high that 10,000 concurrency is too little pressure for him to feel insignificant.

100,000 concurrent

On my system, it’s not possible to start 100,000 threads, so using threads must be ruled out for benchmarking. Probably this can be tweaked somehow by changing system settings, but after trying for an hour, I gave up. Therefore, in 100,000 concurrency, the fight may not want to use threads.

In the case of 100,000 concurrency, the Go program was not only defeated by Rust, but also successively defeated by Java, C# and Node.JS, and has dropped to the fourth place…

I once suspected that on Linux, the .NET program might be cheating, because its memory usage still didn’t rise. I had to double check that it actually started the correct number of tasks, but it did. It still ends correctly after about 10 seconds, so it doesn’t block the main loop. Simply black magic! Well done! .NET.

1 million concurrent

Now enter the climax of the test, we use 1 million concurrency for extreme testing.

In 1 million tasks, Elixir crashed directly, prompting:

** (SystemLimitError) a system limit has been reached

ps: Some commenters pointed out that I can increase the process limit. After adding the –erl ‘ + P 1000000’ parameter to the Elixir call it worked fine.

Finally, we see that the memory consumption of the C# program has increased though. But it’s still a strong contender for today’s title. It even manages to slightly beat one of Rust’s async frameworks!

To my surprise, the distance between Go and other languages has increased. Now Go loses to the winner (Rust tokio framework) with over 12x overhead. At the same time it even lost to Java by more than 2 times, which contradicts the common belief that JVM is memory hog and Go is lightweight.

The ultimate winner is the Rust tokio framework, which remains unmatched across all tests. Especially in the last 1 million concurrent test.

Conclusion

As we have observed, a large number of concurrent tasks can consume a lot of memory, even if they do not perform complex operations. Different language runtimes have different tradeoffs, some are lightweight and efficient for a small number of tasks, but scale poorly for hundreds of thousands of tasks.

In contrast, other runtimes like .NET, Java, which have high initial overhead, can handle high workloads effortlessly. It is important to note that not all runtimes are capable of handling a large number of concurrent tasks with the default settings.

Of course, today’s tests only focus on memory consumption, but in real high-concurrency applications, other factors such as task startup execution time and communication speed are equally important. It’s worth noting that at 1 million concurrent tasks, the overhead of starting tasks becomes noticeable, with most programs taking more than 12 seconds to complete.

Interested students, please continue to pay attention to the upcoming benchmark test, we will explore other aspects in depth in the future.