Pinning data to GPU in Tensorflow and PyTorch

Posted on mar. 02 octobre 2018 in Tutorial by Laura Domine

The traditional and recommended data pipeline for deep learning involves pre-processing the data on CPU (data augmentation, cropping, etc), then loading small batches of pre-processed data on the GPU. There are several good reasons for this:

  • The datasets are often huge and cannot fit on the GPU memory.
  • The networks are big and the memory transfer overhead is negligible compared to the network computations.

However this does not always apply. If the dataset is small enough to fit on the GPU memory or the network computation time is of the same order as the memory transfer overhead, we start to think about doing the pre-processing directly on GPU.

CPU / GPU pipeline Pre-processing on CPU, training on GPU and idle times (figure from Tensorflow documentation)

Some context on our use case: We want to train a network on 3D images that are too big to be fed directly to the network. Our current pipeline is to crop our big images on CPU before feeding the crops one by one to the network training on GPU. First, the extraction of crops turns out to be expensive on CPU (of same order of magnitude as our network computations) and easily parallelizable on a GPU. Second, this scheme involves many small CPU-GPU memory transfers (one per crop) which we would like to avoid, as it costs a lot of time. Instead we want to transfer a handful of big images on the GPU in one shot, crop them on the GPU and feed them to the network without going back to the CPU.

The cropping part involves writing our own custom CUDA kernel and integrating it in Tensorflow or PyTorch. We won't talk about this here. Let's focus on the data movement part.

To summarize what we want to achieve without the context details:

  1. Load a batch of data on CPU
  2. Transfer the batch to GPU
  3. For each image in that batch:
    1. Do some pre-processing on GPU, which outputs a batch of possibly unknown length (e.g. the number of crops might not be deterministic).
    2. Pin the data to the GPU (i.e. prevent it from going back to CPU).
    3. Use the pre-processed batch to do further computations on minibatches (such as training a network).

We will go over toy example for this pipeline using both Tensorflow and PyTorch.

Important warning If you work with more traditional 2D images you might want to use the recent DALI library from NVIDIA. It solves exactly this issue: pre-processing the data on GPU before feeding it to a deep learning framework. They have bindings to TensorFlow and PyTorch, too. In our case 3D images are not (yet) supported by DALI and no short-term implementation is planned, which explains why we have to tackle this 'by hand'.

In Tensorflow

It turns out to be surprisingly hard in Tensorflow. First, it is hard to determine whether a Tensorflow operation implementation is available for GPU. The only way is to check the Github repository and look for a CUDA kernel corresponding to the operation. Second, it turns out most pre-processing operations such as tf.train.batch are implemented on CPU. (Random sidenote: random operations such as tf.random_uniform or tf.random_crop also seem to be only available on CPU.) Of course, Tensorflow recommends that pre-processing takes place on CPU... What it means for us: we might do our pre-processing on GPU, but as soon as we try to batch it for the actual computation it will be sent back to CPU.

The only way to pin data to the GPU in Tensorflow is to declare it as a tf.constant. It gives rise to a convoluted but working pipeline: load a batch of data on GPU as a tf.constant, do the preprocessing on GPU, then use a placeholder for the index that defines a minibatch. This approach suggested in this blog post works well, but one detail that is left out is how to change our batch of data: once it has been consumed by the network, how do we proceed to the next batch of data and declare it as another tf.constant? How do we run the network on that new constant? As you may know, once the graph has been defined, Tensorflow freezes it and runs always the same graph.

The answer is to do some surgery with the Tensorflow computation graph: for each batch of data, remove the node for the tf.constant and replace it with the new batch.

Let's demonstrate with a toy example how to do it in practice. First let us define our data: an array of shape (100, 3). We want to transfer it to GPU in batches of 20, do some pre-processing and then feed it to the network one by one.

In [ ]:
import numpy as np
import tensorflow as tf

# The size of each initial batch.
BATCH_SIZE = 20
# The size of minibatch size which we want to pre-process.
MINIBATCH_SIZE = 1
# Initial number of images/data.
N = 100

# Create the dataset in CPU memory
np_data = np.array(range(N*4*4), dtype=np.float32).reshape(N, 4, 4)

Now we define the computation graph:

In [ ]:
with tf.Graph().as_default() as g:
    # Load a batch of data on GPU
    tf_data = tf.constant(np_data[0:BATCH_SIZE], dtype=tf.float32, name='data')
    # Index of the minibatch inside the current batch
    ix = tf.placeholder(shape=(), dtype=tf.int32, name='ix')
    # ix = tf.constant(0, dtype=tf.int32, name='ix')
    # Select a single image from that batch = shape (1, 3, 3)
    batch = tf.slice(tf_data, [MINIBATCH_SIZE * ix, 0, 0], [MINIBATCH_SIZE, -1, -1], name='batch')
    # ...
    # Do some pre-processing here on the batch, which outputs a minibatch of size (4, 2, 2)
    # ...
    minibatch = tf.reshape(batch, (-1, 2, 2))[:4]
    # Do something with the minibatch - here dummy computation
    # If we wanted to work on the minibatch slice by slice, we 
    # could have another index placeholder
    outp = tf.reduce_sum(tf.square(minibatch), name='outp')
    # Save graph definition
    gdef = g.as_graph_def()

ix is a placeholder for the index inside the current batch. The batch data is defined as a tf.constant to force it to stay on GPU once it has been moved there. We use tf.slice to extract the data corresponding to our index ix from our initial batch, for the pre-processing step. After pre-processing we end up with a minibatch which is made of several images. outp performs some dummy computation on this minibatch. We save the graph definition in gdef variable for our later surgery.

In [ ]:
tf.reset_default_graph()

with tf.Session() as sess:
    # Set tf.AUTO_REUSE to be allowed to re-import the graph at each batch
    with tf.variable_scope('', reuse=tf.AUTO_REUSE):
        # Loop over batches of data of size BATCH_SIZE
        for idx in range(N/BATCH_SIZE):
            new_data = tf.constant(np_data[BATCH_SIZE*idx:BATCH_SIZE*(idx+1)], dtype=tf.float32, name='data%d' % idx)
            tf.import_graph_def(gdef, input_map={'data:0': new_data}, name='')
            # If we wanted to train a network we should save/restore weights
            # at this level.
            # sess.run(tf.global_variables_initializer())
            # For each batch, we are going to run the computation graph on a MINIBATCH_SIZE sample
            for i in range(BATCH_SIZE/MINIBATCH_SIZE):
                o_tensor = tf.get_default_graph().get_tensor_by_name('outp:0' if idx == 0 else 'outp_%d:0' % idx)
                o = sess.run([o_tensor], feed_dict={tf.get_default_graph().get_tensor_by_name('ix:0' if idx == 0 else 'ix_%d:0' % idx): i})

The key to the surgery on TF computation graph lies in tf.import_graph_def. We use the keyword argument input_map to map the data:0 constant node to a new constant node which holds the next batch of data. Note that the name argument should be set to an empty string, or all the variables will have an additional name scope appended to their names.

Warning: tf.import_graph_def only restores the graph, it does not restore variables values. If we wanted to train a real network, we should store all the weights for each batch of data and restore them after we do our surgery on the graph. For the sake of simplicity we leave this out to the reader. Please note that it can be yet another downside of this method, since storing/restoring weights involves additional memory transfers between CPU/GPU.

Profiling If we time it using nvprof profiler, we can see that there are only 5 host to device transfers (i.e. CPU to GPU) as expected. There are however still 100 transfers from device to host (GPU to CPU): every time we call sess.run in Tensorflow, after the computation graph is executed all the tensors that were requested are brought back to CPU (and each tensor brought back to CPU takes 1 call to CUDA memcpy DtoH, in our case we only asked for the output tensor).

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   29.61%  113.89us       100  1.1380us  1.0880us  1.5040us  [CUDA memcpy DtoH]
                    1.59%  6.1120us         5  1.2220us  1.1200us  1.4080us  [CUDA memcpy HtoD]

As you can see any data transfer will take at least 1 microsecond, no matter how small the data is. Let us increase the dataset to a batch size of 200 and 1000 entries, keeping the same ratio 1:5 between the batch size and the dataset size. Now we can clearly see the difference:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   30.19%  1.1380ms      1000  1.1370us  1.0870us  4.3200us  [CUDA memcpy DtoH]
                    0.30%  11.296us         5  2.2590us  2.2400us  2.2720us  [CUDA memcpy HtoD]

Despite the data size being 10 times bigger in HtoD transfers, the average time for each call is only twice bigger. If we had kept our 'naive' scheme, sending the minibatch data one by one to the GPU, it would have increased similarly to the current DtoH transfers, by a factor of 10. So using this strategy already cuts by almost half the memory transfer time needed to achieve our goal!

In PyTorch

PyTorch is meant to be more flexible and DIY spirit than Tensorflow, so it is not surprising if this pipeline is much easier to achieve in PyTorch.

In [ ]:
import numpy as np
import torch
import torch.utils.data

# The size of each initial batch.
BATCH_SIZE = 20
# The size of minibatch size which we want to pre-process.
MINIBATCH_SIZE = 1
# Initial number of images/data.
N = 100

Then we create a dataset:

In [ ]:
# Create a dataset on CPU
np_data = np.array(range(N*4*4), dtype=np.float32).reshape(N, 4, 4)
# Load to Torch tensor
data = torch.from_numpy(np_data)
dataset = torch.utils.data.TensorDataset(data)

Creating batches from the dataset is simple and we can specify that it should be pinned to the device memory with pin_memory:

In [ ]:
# Prepare batches
batch = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, pin_memory=True)

Now we can iterate over the batches and do our pre-processing:

In [ ]:
# Iterate over batches
for i, data in enumerate(batch):
    image, = data
    # Load the batch to GPU
    image = image.cuda()
    # Slice into chunks
    chunks = torch.chunk(image, BATCH_SIZE/MINIBATCH_SIZE, dim=0)
    for c in chunks:
        # ...
        # Do some pre-processing and output a minibatch
        # ...
        minibatch = c.view((-1, 2, 2))[:4]
        # If we wanted to work on the minibatch images one by one we could use
        # torch.chunk again.
        output = torch.sum(torch.sqrt(minibatch))

Profiling Running nvprof gives us:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   17.43%  113.44us       100  1.1340us  1.0880us  1.6000us  [CUDA memcpy DtoH]
                   13.04%  84.896us        13  6.5300us     992ns  67.968us  [CUDA memcpy HtoD]

We have 8 unexpected calls to CUDA memcpy HtoD which come from the torch.utils.data.DataLoader call. They are independent of the batch size and the dataset size. The rest of the calls is as expected, and if you look back on the code it looks much cleaner than the TF equivalent.