/ High performance computing

High Performance Compute for the JVM - A Prerequisite for DL4J (Part 2)

Unlike most popular deep learning frameworks, DL4J was created with Java principles and the JVM in mind. Originally, its backends were all in Java, but those days are long gone, and ND4J now uses native backends for both CPU and CUDA.

CUDA is of special interest to our users, since it dramatically boosts performance in parallel computations. This significantly lowers the time required for tuning and training models. While we seamlessly support NVIDIA’s cuDNN library of deep learning primitives, we also make the power and performance of the GPUs accessible to end users who don't have cuDNN installed.

This kind of stack - all in Java with a “native” backend - comes with its own set of unique challenges. These native operations are essentially a set of independent, individual operations applied to the same data. This managed memory model creates some challenges, so we constantly look for ways that will allow us to improve performance, without breaking code universality for end user.

Under the hood


Deep learning operations are mostly linear algebra. All of this magic can be represented as a sequence of algebraic transformations applied to the data in some specific order.


Memory tricks & cheats

Memory Reuse

One of the significant hurdles here is allocation cost. In a C/CUDA application, the device/host memory is ideally allocated once, and then reused as long as possible. The case is very different in Java, where local scope variables are heavily used, and where such use may even be considered «good practice».

To deal with this challenge, we’ve added a special caching layer that guarantees memory reuse over time. The idea is simple: when the JVM releases an INDArray object, the native device/host memory used for it isn’t freed, but stored for reuse. When allocation of a similar memory chunk is requested some time later, that request is served directly from the cache, thus lowering the allocation time down to flat constant time in the order of tens of nanoseconds. According to our tests, the average cache hit rate for a typical deep learning workload is somewhere around 95-100%, depending on the size of the cache.

Immutable buffers

There are also a few cases when the typical workflow involves the creation of arrays with the same contents over and over again. To optimize for this, we have defined a special case with immutable buffers. This mostly applies to Op parameters and INDArray shapeInformation buffers (the contents of the shapeInformation hold rank, dimension sizes, ordering, etc). So, during the first call of such an immutable buffer, it will be initialized, and then moved to device constant memory space, for faster access from kernels during runtime. The cache hit rate for this layer is 100% after the first training iteration.

Dimensional information

The last caching layer is the TAD (Tensor Along Dimension) information cache. As mentioned earlier, deep learning involves sequence of transformations on data, often in specific dimensions. This caching layer targets dimensional information, to speed up navigation within an INDArray along a given dimension(s). This cache partially stores information in device constant memory space, and partially in device global memory. Cache hit rate for this layer is also 100% after first training iteration.

Operations combination

As discussed in Part1 of this post, all op executions involve internal parallelism, and here we address the most important of all CUDA mechanics: memory bandwidth.

Let us consider how an op works from the perspective of a single CUDA kernel thread:

  • A thread gets an array element from global memory
  • An operation is applied to this element
  • A result is being written back to global memory

Thousands of such threads do the above at the same time symmetrically. This is known as the SIMD/SIMT execution model: Single Instruction Multiple Data/Single Instruction Multiple Threads.

However one can easily imagine a situation when two consequent operations are, in fact, being applied to the same array. This will result in double kernel calls, double global memory reads, double op application and double global memory writes, which is certainly not ideal. Our solution here is “automatic op combination”, which results in the following scenario:

  • A thread gets an array element from global memory
  • An operation A is applied to this element
  • An operation B is applied to this element
  • A result is being written back to global memory

In this manner, we save a second global memory read/write cycle, effectively doubling our memory bandwidth.

In the latest release (0.6.0) we’ve introduced an initial version of a sophisticated execution pipeline that allows us to combine certain multiple operations into solid blocks. This feature boosts performance by reducing the memory bandwidth required for individual operations and reducing the total number of calls going to the GPU.

In subsequent releases the support for such recombinations will be extended, based on the evolving demands of use cases we see from our customer base.



Skymind is the company behind Deeplearning4j, the only commercial-grade, open-source, distributed deep-learning library written for the JVM.

Read More