Training Semantic-Segmentation Data Set (v0.1.0)

Posted on ven. 05 janvier 2018 in semantic segmentation by Kazuhiro Terao

In this notebook I demonstrate training of a semantic segmentation algorithm for separating track/shower particles at pixel level using practice samples in public data set. To learn about this sample, or what the challenge is about, you might want to start from this post. Here's an outline.

  1. U-ResNet (semantic segmentation network instance)
  2. IO configuration
  3. Training the net
  4. Training results/summary
  5. Extra: a bit more about U-ResNet

I usually write a closing remark in my notebook. But in this one, I write it at the beginning because of a higher chance to be read by you :) If you have a question, comment or request regarding this notebook, softare or data used, please feel free to contact us! Your inputs are super valuable. If you find softwares used in this notebook useful (larcv, u-resnet), please give stars! You can make us really happy by explicitly cheering us up :)

OK back to the business. Before anything, let's start with the basic (usual) imports.

In [1]:
import ROOT
from larcv import larcv
larcv.ThreadProcessor
from larcv.dataloader2 import larcv_threadio
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import os,sys,time

# tensorflow/gpu start-up configuration
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=2
import tensorflow as tf
Welcome to ROOTaaS 6.06/06
env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=2

U-ResNet

In this tutorial, we use U-ResNet, one type of convolutional neural networks for sematic segmentation task. It is developed/used within MicroBooNE experiment. They showed a beautiful track/shower separation images on both data and simulation in conferences. Below is a result of running a trained U-ResNet on real MicroBooNE detector data (source).

Data Network

This work is originally implemented using caffe and the details are under preparation to be shared in public. However we made the network design available in our group repository which is what we use in this notebook. We do not describe the details of U-ResNet in this notebook (well, only briefly at the last section). That should be documented in U-ResNet repository, and I might make a separate blog post for that. For now, just think of it as a generic semantic segmentation algorithm that can be trained for track/shower separation today.

Apology for a long preface! Let's pull the code first.

In [2]:
%%bash
if [ ! -d "u-resnet" ]; then
    git clone https://github.com/DeepLearnPhysics/u-resnet
fi
Cloning into 'u-resnet'...

Let's add u-resnet python library to system path, and we are ready to import.

In [3]:
# u-resnet directory
sys.path.insert(0,'u-resnet/lib')
import ssnet_trainval as api
t = api.ssnet_trainval()

Software preparation is done. We will use ssnet_trainval python class. It's an interface to run training and/or inference (i.e. analysis) of U-ResNet. It needs a configuration file to pass parameters, which we cover in the next part.

IO Configuration

Here we go over 3 kinds of configurations.

  1. larcv_threadio python API to read larcv data for training
  2. larcv_threadio python API to read larcv data for testing during training (optional)
  3. U-ResNet configuration to run training (U-ResNet specific)

Disregard of the network choice, first two should be useful. The last one is U-ResNet specific, so it won't apply if you have your own network (or your own implementation of U-ResNet).

Configuring input: training

This is larcv_threadio configuration and you can look at this quick start to get the big picture. The only difference from image classification example is that, this time, label data is also an image! So we use the same BatchFillerImage2D to read data out of larcv file. We will use practice_train_2k.root provided in public data.

In [4]:
io_config = \
"""
MainIO: {
  Verbosity:    3
  EnableFilter: false
  RandomAccess: 2
  RandomSeed:   123
  InputFiles:   ["/stage/drinkingkazu/public_data/segment2d/without_weights/practice_train_2k.root"]
  ProcessType:  ["BatchFillerImage2D","BatchFillerImage2D"]
  ProcessName:  ["main_data","main_label"]
  NumThreads: 4
  NumBatchStorage: 4

  ProcessList: {
    main_data: {
      Verbosity: 3
      ImageProducer: "data"
      Channels: [0]
    }
    main_label: {
      Verbosity: 3
      ImageProducer: "segment"
      Channels: [0]
    }
  }
}      
"""

import tempfile
train_io_config = tempfile.NamedTemporaryFile('w')
train_io_config.write(io_config)
train_io_config.flush()

We use a handy python tempfile to create a named file that only exists during its lifetime (and hence only in this process).

You might have noticed other slight differences. First, the most-outer-block is named "MainIO" as opposed to "ThreadProcessor" back in the image classification example. This is purely conventional: u-resnet code identifies the training image supplier with a keyword MainIO. Also you see two BatchFillerImage2D instances are named as main_data and main_label. This is to distinguish with other BatchFillerImage2D instances that we will run for fetching test data samples, that comes next!

Configuring input: testing

Wait, what is testing? During training, to watch out for over-fitting, we will continuously monitor the network's performance on the given task on a statistically separate data set. For this, we use practice_test_2k.root provided in public_data.

In [5]:
io_config = \
"""
TestIO: {
  Verbosity:    3
  EnableFilter: false
  RandomAccess: 2
  RandomSeed:   123
  InputFiles:   ["/stage/drinkingkazu/public_data/segment2d/without_weights/practice_test_2k.root"]
  ProcessType:  ["BatchFillerImage2D","BatchFillerImage2D"]
  ProcessName:  ["test_data","test_label"]
  NumThreads: 2
  NumBatchStorage: 2

  ProcessList: {
    test_data: {
      Verbosity: 3
      ImageProducer: "data"
      Channels: [0]
    }
    test_label: {
      Verbosity: 3
      ImageProducer: "segment"
      Channels: [0]
    }
  }
}

"""

import tempfile
test_io_config = tempfile.NamedTemporaryFile('w')
test_io_config.write(io_config)
test_io_config.flush()

The configuration is very similar to training sample, except we will rename the most-outer-block to TestIO, again u-resnet specific convention. Also, to distinguish from BatchFillerImage2D instances in MainIO, we name them as test_data and test_label. But the rest is pretty much the same.

U-ResNet configuration

This is a U-ResNet specific configuration file.

In [6]:
train_config = \
"""
NUM_CLASS          3
BASE_NUM_FILTERS   16
MAIN_INPUT_CONFIG  '{:s}'
TEST_INPUT_CONFIG  '{:s}'
LOGDIR             'ssnet_train_log'
SAVE_FILE          'ssnet_checkpoint/uresnet'
LOAD_FILE          ''
AVOID_LOAD_PARAMS  []
ITERATIONS         8000
MINIBATCH_SIZE     20
NUM_MINIBATCHES    1
DEBUG              False
TRAIN              True
TF_RANDOM_SEED     123
USE_WEIGHTS        False
REPORT_STEPS       200
SUMMARY_STEPS      20
CHECKPOINT_STEPS   100
CHECKPOINT_NMAX    20
CHECKPOINT_NHOUR   0.4
KEYWORD_DATA       'main_data'
KEYWORD_LABEL      'main_label'
KEYWORD_WEIGHT     ''
KEYWORD_TEST_DATA  'test_data'
KEYWORD_TEST_LABEL 'test_label'
KEYWORD_TEST_WEIGHT ''
"""

import tempfile
ssnet_config = tempfile.NamedTemporaryFile('w')
ssnet_config.write(train_config.format(train_io_config.name, test_io_config.name))
ssnet_config.flush()

Let's go over parameters, in case you want to use/customize u-resnet code for yourself. Otherwise skip to the training.

  • NUM_CLASS ... Number of classification categories (3 = background, shower, track).
  • BASE_NUM_FILTERS ... U-ResNet is made of a block of ResNet modules. This parameter defines the number of filters in the initial convolution layer (then it doubles per ResNet module). You can think of this as defining the overall scale factor for the neuron count in the whole network.
  • MAIN_INPUT_CONFIG ... a configuration file for larcv_threadio to read data from larcv file (either training or inference)
  • TEST_INPUT_CONFIG ... a configuration file for larcv_threadio to read data from larcv file (only for training)
  • LOGDIR ... where tensorboard log is created
  • SAVE_FILE ... where checkpoints will be saved
  • LOAD_FILE ... if provided, load the network's initial parameter values (weights) from this file (for training or inference)
  • AVOID_LOAD_PARAMS ... a list of parameters which values won't be loaded from the weights provided by LOAD_FILE.
  • ITERATIONS ... number of iterations to automatically run the training or inferencing
  • MINIBATCH_SIZE ... number of data counts to be fed into the processors (GPU/CPU) at once.
  • NUM_MINIBATCHES ... if >1, gradient is accummulated over number of mibi-batches defined by this parameter.
  • DEBUG ... more verbose output from u-resnet if True
  • TF_RANDOM_SEED ... random number seed for tensorflow
  • USE_WEIGHTS ... if enabled, pixel-wise loss weighting factor will be applied (requires KEYWORD_WEIGHT
  • REPORT_STEPS ... reports accuracy nad loss when the iteration count is in modular of this value.
  • SUMMARY_STEPS ... stores tensorboard log when the iteration count is in modular of this value.
  • CHECKPOINT_STEPS ... stores checkpoint when the iteration count is in modular of this value.
  • CHECKPOINT_NMAX ... number of most recent checkpoints to be kept from deletion
  • CHECKPOINT_NHOUR ... checkpoints made every N hour specified in this value will be kept from deletion
  • KEYWORD_DATA ... BatchFillerImage2D's instance name to fetch image data for training.
  • KEYWORD_LABEL ... BatchFillerImage2D's instance name to fetch label data for training.
  • KEYWORD_WEIGHT ... BatchFillerImage2D's instance name to fetch weight data for training.
  • KEYWORD_TEST_DATA ... BatchFillerImage2D's instance name to fetch image data for testing.
  • KEYWORD_TEST_LABEL ... BatchFillerImage2D's instance name to fetch label data for testing.
  • KEYWORD_TEST_WEIGHT ... BatchFillerImage2D's instance name to fetch weight data for testing.

Phew, that was long! In short, we will use a batch size of 20 images, and run for 8000 iterations. Since our input file hods 2000 images, this is about 80 epochs. For a tutorial purpose, this should be enough (I'm speaking from experience, don't look for justification in this text!).

Training the net

We first override the default configuration with the custom one defined above. Then initialize the u-resnet APIs.

In [7]:
t.override_config(ssnet_config.name)
t.initialize()
KEYWORD_LABEL....... main_label
LEARNING_RATE....... -1
REPORT_STEPS........ 200
KEYWORD_TEST_DATA... test_data
KEYWORD_DATA........ main_data
ITERATIONS.......... 8000
MAIN_INPUT_CONFIG... /tmp/tmpRRSKJw
KEYWORD_TEST_WEIGHT. 
SAVE_FILE........... ssnet_checkpoint/uresnet
CHECKPOINT_NHOUR.... 0.4
ANA_OUTPUT_CONFIG... 
CHECKPOINT_STEPS.... 100
USE_WEIGHTS......... False
TF_RANDOM_SEED...... 123
MINIBATCH_SIZE...... 20
KEYWORD_WEIGHT...... 
TEST_INPUT_CONFIG... /tmp/tmpktF1Ot
TEST_BATCH_SIZE..... 10
BASE_NUM_FILTERS.... 16
DEBUG............... False
SUMMARY_STEPS....... 20
NUM_CLASS........... 3
LOAD_FILE........... 
AVOID_LOAD_PARAMS... []
TRAIN............... True
LOGDIR.............. ssnet_train_log
CHECKPOINT_NMAX..... 20
NUM_MINIBATCHES..... 1
KEYWORD_TEST_LABEL.. test_label
 setting verbosity 3
 setting verbosity 3

Next, actually all you have to do is to execute t.batch_process() and it will run 8000 training iterations. Then we can watch how loss and accuracy changes (= get bored and go watch movie instead). But that is a bit boring. So I propose to play with one image from the test_10k.root during training.

In [ ]:
ENTRY=2
def get_entry(entry):
    # image
    chain_image2d = ROOT.TChain("image2d_data_tree")
    chain_image2d.AddFile('test_10k.root')
    chain_image2d.GetEntry(entry)
    cpp_image2d = chain_image2d.image2d_data_branch.as_vector().front()
    # label
    chain_label2d = ROOT.TChain("image2d_segment_tree")
    chain_label2d.AddFile('test_10k.root')
    chain_label2d.GetEntry(entry)
    cpp_label2d = chain_label2d.image2d_segment_branch.as_vector().front()    
    return (np.array(larcv.as_ndarray(cpp_image2d)), np.array(larcv.as_ndarray(cpp_label2d)))

image2d, label2d = get_entry(ENTRY)
fig, (ax0,ax1) = plt.subplots(1,2,figsize=(16,8), facecolor='w')
ax0.imshow(image2d, interpolation='none', cmap='jet', vmin=0, vmax=1000, origin='lower')
ax0.set_title('image',fontsize=24)
ax1.imshow(label2d, interpolation='none', cmap='jet', vmin=0, vmax=3.1, origin='lower')
ax1.set_title('label',fontsize=24)
plt.show()

On the left, you see the input data image. On the right is the label, an answer that our algorithm is supposed to generate. This is the same image frequently reffered to when we browsed the file contents, the event with 2 protons and 2 gamma rays. During the network training, we will periodically run the network to analyze this event image.

In order to add this small task, we write a simple training loop below by ourselves instead of using a prepared batch_process function (of u-resnet). We will analyze the above image at certain steps defined in image_dump_steps below.

If you want to jump to the results of the training, click here.

In [ ]:
input_shape  = [1,image2d.size]
image_data = np.array(image2d).reshape(input_shape)

image_dump_steps = np.concatenate((np.arange(0,100,20), 
                                   np.arange(100,400,100), 
                                   np.arange(400,1000,200), 
                                   np.arange(1000,20000,500))).astype(np.int32)

while t.current_iteration() < t.iterations():
    t.train_step()
    if t.current_iteration() in image_dump_steps:
        print('Image dump @ iteration {:d}'.format(t.current_iteration()))
        
        softmax, = t.ana(input_data = image_data)
        fig, (ax0,ax1,ax2) = plt.subplots(1,3,figsize=(24,8), facecolor='w')
        # image
        ax0.imshow(image2d, interpolation='none', cmap='jet', vmin=0, vmax=1000, origin='lower')
        ax0.set_title('image',fontsize=24)
        
        ax1.imshow(softmax[0,:,:,0], interpolation='none', cmap='jet', vmin=0, vmax=1.0, origin='lower')
        ax1.set_title('background score',fontsize=24)
        
        ax2.imshow(softmax[0].argmax(axis=2), interpolation='none', cmap='jet', vmin=0., vmax=3.1, origin='lower')
        ax2.set_title('classification', fontsize=24)
        plt.savefig('iteration_{:04d}.png'.format(t.current_iteration()))
        plt.show()
        plt.close()
Image dump @ iteration 20
Image dump @ iteration 40
Image dump @ iteration 60
Image dump @ iteration 80
saved @ ssnet_checkpoint/uresnet-100
Image dump @ iteration 100
@ iteration 200
Train set: loss=1716.585449   acc. all=0.991888   acc. nonzero=0.098823   
Test set: loss=1186.800781   acc. all=0.994053   acc. nonzero=0.096482   
saved @ ssnet_checkpoint/uresnet-200
Image dump @ iteration 200
saved @ ssnet_checkpoint/uresnet-300
Image dump @ iteration 300
@ iteration 400
Train set: loss=1704.628906   acc. all=0.990824   acc. nonzero=0.116905   
Test set: loss=1576.977783   acc. all=0.991348   acc. nonzero=0.126500   
saved @ ssnet_checkpoint/uresnet-400
Image dump @ iteration 400
saved @ ssnet_checkpoint/uresnet-500
@ iteration 600
Train set: loss=1089.086426   acc. all=0.993586   acc. nonzero=0.144060   
Test set: loss=1231.338867   acc. all=0.992651   acc. nonzero=0.182390   
saved @ ssnet_checkpoint/uresnet-600
Image dump @ iteration 600
saved @ ssnet_checkpoint/uresnet-700
@ iteration 800
Train set: loss=870.101807   acc. all=0.994499   acc. nonzero=0.197649   
Test set: loss=971.724487   acc. all=0.994194   acc. nonzero=0.231123   
saved @ ssnet_checkpoint/uresnet-800
Image dump @ iteration 800
saved @ ssnet_checkpoint/uresnet-900
@ iteration 1000
Train set: loss=791.630127   acc. all=0.995163   acc. nonzero=0.277875   
Test set: loss=1155.722412   acc. all=0.993097   acc. nonzero=0.306119   
saved @ ssnet_checkpoint/uresnet-1000
Image dump @ iteration 1000
saved @ ssnet_checkpoint/uresnet-1100
@ iteration 1200
Train set: loss=926.961670   acc. all=0.994331   acc. nonzero=0.367506   
Test set: loss=935.218201   acc. all=0.994130   acc. nonzero=0.362160   
saved @ ssnet_checkpoint/uresnet-1200
saved @ ssnet_checkpoint/uresnet-1300
@ iteration 1400
Train set: loss=725.397339   acc. all=0.995700   acc. nonzero=0.461138   
Test set: loss=811.109619   acc. all=0.995306   acc. nonzero=0.472196   
saved @ ssnet_checkpoint/uresnet-1400
saved @ ssnet_checkpoint/uresnet-1500
Image dump @ iteration 1500