ThreadProcessor is an API to multi-thread data loading from
larcv files for efficient training of a deep neural network. I recently added a wiki page that describes how the internals work. There are a few notebooks in our tutorial that can be used as a reference. One for simple access to data contents, another on image classification and one more on semantic segmentation training.
I hope that was enough references to learn about it :) because this blog post is not about how to use it, but instead I share some findings from simple speed profiling test with different configurations. This study was prompted by reports from Taritree Wongjirad after finding some cases where the performance of the code became bad = slow (thanks Tari!).
Below you see different configurations I compared the results for. The tests were run on my macbook pro (16GB RAM, SSD, early 2013) as well as dell XPS15 (32GB RAM, SSD, model 9560) and show similar results (agreement on each measurement within ~10%). The actual data rate measured strongly depend on a) hardware specs and b) configurations (such as RAID mode). So take numbers as a grain of salt, and try to measure this on your system (click here for instructions).
The reason why we have multi-threaded file reader is so that we can CPU for data read while GPU is busy training a network. However, multi-threading can speed up data read speed even in CPU-only mode because our data is heavily compressed in our file = multiple threads can parallelize decoding of compressed data and file read tasks. Below, you see the comparison of an average data read speed measured every 1 second. The vertical axis shows the amount of data decoded and made available on RAM and the horizontal axis shows time elapsed since the beginning of executing the script. We can see there's almost a linear gain when we increase number of threads to 2, and then still a good fraction of increase but no longer linear when increasing to 3 threads. The increase from 3 to 4 threads is still clear but much smaller. Then finally 5 threads do not help any more.
Data access, copy vs. pointer wrapper
Here we vary two configurations:
RandomAccess configuration parameter of
make_copy configuration parameter of
dataloader2, a dedicated python API. For
RandomAccess, there are 3 possible values:
0 = no random entry access from the input file(s),
1 = completely randomize the entry to access, or
2 = access the random slice of input data. Among these options,
0 is the cheapest since it follows the order of data entries in a file.
1 is the most expensive as it requires a file header to move from one entry to another. The fact that each entry size varies in our data format makes this even slower.
2 is a simple intermediate solution between them by making only the first entry of a particularly sized data slice random.
make_copy is a configuration specific to
dataloader2, a python layer, and does not affect underlying
ThreadProcessor C++ API. It is
False by default in which case numpy array of loaded data is a mere pointer wrapper on underlying C array that is owned by
ThreadProcessor. When set to
dataloader2 prepares a dedicated
numpy array buffer to hold loaded data from C++ API and runs an explicit data copy. Although it can be expensive due to copying, it can be useful sometimes when you want to massage
numpy data without affecting underlying C++ data handling.
The plot above shows the data-read speed in MB/s (mega-byte-per-second) measured as a function of elapsed time in seconds. The legends show configuration of two parameters. We can clearly see that the biggest slow down is caused by setting
RandomAccess=1, almost 8 times slower than the two optimal configurations (red and purple). Comparing green (Copy+Random Slice) vs. purple (Wrap+Random Slice), we can see how data copy can affect the speed performance by almost a factor of 2. Finally, from the comparison of red (Wrap+No Random) and purple (Wrap+Random Slice), we see there is virtually no slow down by allowing slicing point to be randomly set. Therefore we see that
make_copy=False (latter is default) seems like a good point to sit.
Do It Yourself
Here's asciinema (loving it!) video for running the test script at a GPU tower I often use.