Could Worker Threads speed up common Node.js applications in 2019?

How are threads relevant to Node.js performance?

I have been interested in ways of offloading CPU-intensive work from the Node.js main Event Loop thread ever since I noticed the promising webworker-threads library many years ago in the Node.js v0.10.x days. And since the Node v10.5.0 release, Worker Threads have been available in the standard library of a stable Node.js version, albeit still marked as experimental. The usefulness of performing work in Node.js threads other than the main thread, boils down to two main reasons, discussed below.

1) CPU intensive processing in the Node.js main thread kills throughput

A single Node.js process utilizing one CPU core can process at least in the order of 10 000 requests per second, if completing each request only requires a small amount of CPU work and e.g. your database engine or another backend service does the heavy lifting. This matches the workload of many simple REST API backends, e.g. those with straightforward CRUD APIs.

However, say that just a few times per second you get a request which requires a relatively innocent 100ms of main thread CPU work to complete, for a total of 200-300ms every second. While executing that CPU work for those few requests, you will have missed the chance to perform the database queries for at least several thousands of simple requests and your total throughput will be massively reduced!

I’ve found that there’s usually at least some CPU intensive operations in almost all real-world backend services. Typical examples include:

  • Serializing to JSON large responses to API requests, such as those resulting from GraphQL or objection.js relation graph queries

  • Evaluating complex business logic rules on large data volumes (e.g. dynamic pricing of many products)

  • Collecting and transforming larger bunches of data from the database for e.g. generating some single report or document

  • Parsing large complex documents to an easily queryable representation, e.g. when web scraping

  • etc

If we could run these kinds of tasks outside the main event loop thread using Worker Threads, the main thread could be operating at peak performance, dispatching queries to the database and returning results to clients etc. all the time!

2) The main thread alone can’t fully utilize modern CPUs

All modern high-performance CPUs have more than one CPU core, and each core might also be able to be utilized more efficiently by running two different threads on it concurrently via hyperthreading. If all your backend code runs in the single Node.js main thread, you can’t utilize but one of the many CPU cores, and not even that one optimally as that would require at least two active threads. Even the smallest fixed-performance Amazon EC2 instance type that is suitable for an horizontal auto-scaling setup, c5.large, offers two hyperthreads of the same physical CPU as vCPUs. A single-threaded workload is only able to utilize about 80% of the CPU power of c5.large and other similar 2-vCPU EC2 instances.

One can better utilize the server instance with multiple logical CPU cores by running multiple Node.js processes on the same server instance and load-balancing incoming requests between them. This kind of cluster of processes is very easy to run using e.g. the pm2 process manager, something which I highly recommend if running Node.js applications on a multi-core or hyperthreading single-core server instance.

The drawback of the multi-process cluster model is that it requires lots of RAM, as certain memory items such as the JIT generated machine code will be duplicated in each process. An even bigger factor is that the different processes in the cluster can’t co-operate on when to garbage collect, so one process might fail to allocate sufficient memory for processing a request, even if another process could easily free up memory by running their GC. Worker Threads could utilize multiple CPU cores in a single Node.js process, with less memory usage than a process cluster setup.

Note that a cluster setup won’t make the more expensive requests finish any faster, as no matter how many CPUs you have, each request executed by one process in the cluster, and thus uses only one CPU core at a time. Some expensive requests could be completed faster by splitting them to parts that are run in parallel in separate Worker Threads, and thus be completed faster.

Talking to Worker Threads can be slow

Although the performance potential of Node.js Worker Threads is clear, in my older experiments I have found it excessively hard to realize that potential. Similarly to Web Workers in browsers, Node.js Worker Threads can’t be used as straightforwardly as threads in Java or Goroutines in Go to just perform some work on your data in parallel. This is because unlike pretty much all objects and other data in those languages, regular Javascript objects can’t be shared between the main Event Loop thread and Worker Threads. Thus, Worker Threads can not directly use as input the request body or database query result the main thread has received. Similarly, the main thread can not directly access any output produced by a worker thread to e.g. send it to a client or the database.

Instead of sharing data directly, the main thread must send input to the workers and workers must send results back to the main thread by using the postMessage method. See e.g. this tutorial for a practical example.

The trouble is that, at least historically, postMessage has been pretty slow with any larger amounts of data. So slow, in fact, that with any kind of complex data, it has been faster to JSON.stringify the data in the sending end, transfer it as a string, and JSON.parse it in the receiving end. But, as noted before, JSON (de)serialization can be pretty slow, and well, if the operation you are trying to offload to a worker is JSON parsing or serializing, having to do the same operation also in the main thread to get the data to the worker pretty much nullifies any benefit you could otherwise get.

Note how all the examples I gave above of the kinds of CPU intensive tasks I’ve encountered in real-world Node.js applications are all expensive specifically because of the large amounts of data involved and/or the complexity of the data. The artificial examples with small data but still expensive processing like Fibonacci series generation with a bad algorithm, that you see in so many worker tutorials, just aren’t that common in the real world [1]. So, the performance of transferring large and complex data back and forth with the worker threads is key to getting any benefit from executing processing in a threads.

Which kinds of tasks can actually benefit from Worker Threads, then?

Well, that all depends on how quick postMessage is with different kinds of data nowadays. There don’t seem to be any up-to-date benchmarks on the subject, either ones I could find by Googling for Node or Chrome (which would be relevant because of the shared V8 Javascript engine). Node.js performance has improved so much in other areas that benchmarks from 2016 probably don’t do justice to the present situation.

So, I will do some benchmarking of my own. Maybe even the JSON parsing / serialization that we all have in pretty much any app could benefit? We’ll see - stay tuned for the results!

Remarks

[1] As for why slow code with trivial data, but expensive processing, seldom is something you would improve by parallelizing the processing: if the input data is small(ish), it can only have very little information, and it can only have a restricted number of different values, at least ones that occur in practice. Thus, you can often leverage memoization of previously calculated results, or even precalculate the results for all expected input values. As a practical example, think formatting for display date values that occur (usually repeatedly) in database data. Or, if the data is not quite that small, but still is simple in structure, it’s probably quite easy to manipulate also in a C++ add-on, and you don’t really need the power of Javascript for it. And chances are that somebody else has already implemented an asynchronous parallel-processing native library for the expensive part, like is the case for cryptography, data (de)compression, image resizing, ML and so on.