Using lartpc_mlreco3d
¶
Before starting anything, it is good practice to look at your dataset in an event display. This chapter is strictly about the I/O part of lartpc_mlreco3d
(independently of everything else, models, etc) and how to use it to visualize your dataset.
If needed, you can edit the path to lartpc_mlreco3d
library and to the data folder.
import os
SOFTWARE_DIR = '%s/lartpc_mlreco3d' % os.environ.get('HOME')
DATA_DIR = os.environ.get('DATA_DIR')
The usual imports and setting the right PYTHON_PATH
… click if you need to see them.
import sys, os
# set software directory
sys.path.insert(0, SOFTWARE_DIR)
import numpy as np
import yaml
import torch
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=False)
from mlreco.visualization import scatter_points, plotly_layout3d
from mlreco.visualization.gnn import scatter_clusters, network_topology, network_schematic
from mlreco.utils.ppn import uresnet_ppn_type_point_selector
from mlreco.utils.cluster.dense_cluster import fit_predict_np, gaussian_kernel
from mlreco.main_funcs import process_config, prepare
from mlreco.utils.gnn.cluster import get_cluster_label
from mlreco.utils.deghosting import adapt_labels_numpy as adapt_labels
from mlreco.visualization.gnn import network_topology
from larcv import larcv
/usr/local/lib/python3.8/dist-packages/MinkowskiEngine/__init__.py:36: UserWarning:
The environment variable `OMP_NUM_THREADS` not set. MinkowskiEngine will automatically set `OMP_NUM_THREADS=16`. If you want to set `OMP_NUM_THREADS` manually, please export it on the command line before running a python script. e.g. `export OMP_NUM_THREADS=12; python your_program.py`. It is recommended to set it below 24.
Welcome to JupyROOT 6.22/09
Configuration¶
You need to specify a configuration, in YAML syntax, which tells lartpc_mlreco3d
how you want to access the data: how many images (batch_size
), the path to your dataset, which quantities you want to retrieve from the dataset. You can even limit the I/O to a specific list of entry numbers using event_list
.
cfg = """
iotool:
batch_size: 32
shuffle: False
num_workers: 4
collate_fn: CollateSparse
dataset:
name: LArCVDataset
data_keys:
- DATA_DIR/wire_mpvmpr_2020_04_test_small.root
limit_num_files: 10
#event_list: '[6436, 562, 3802, 6175, 15256]'
schema:
input_data:
- parse_sparse3d_scn
- sparse3d_reco
- sparse3d_reco_chi2
segment_label:
- parse_sparse3d_scn
- sparse3d_pcluster_semantics_ghost
cluster_label:
- parse_cluster3d_clean_full
- cluster3d_pcluster
- particle_pcluster
- particle_mpv
- sparse3d_pcluster_semantics
particles_label:
- parse_particle_points_with_tagging
- sparse3d_pcluster
- particle_corrected
""".replace('DATA_DIR', DATA_DIR)
Now that the configuration is defined, you can feed it to lartpc_mlreco3d
:
cfg=yaml.load(cfg,Loader=yaml.Loader)
# pre-process configuration (checks + certain non-specified default settings)
process_config(cfg)
# prepare function configures necessary "handlers"
hs=prepare(cfg)
Config processed at: Linux tur012 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$CUDA_VISIBLE_DEVICES="0"
{ 'iotool': { 'batch_size': 32,
'collate_fn': 'CollateSparse',
'dataset': { 'data_keys': [ '/sdf/home/l/ldomine/lartpc_mlreco3d_tutorials/book/data/wire_mpvmpr_2020_04_test_small.root'],
'limit_num_files': 10,
'name': 'LArCVDataset',
'schema': { 'cluster_label': [ 'parse_cluster3d_clean_full',
'cluster3d_pcluster',
'particle_pcluster',
'particle_mpv',
'sparse3d_pcluster_semantics'],
'input_data': [ 'parse_sparse3d_scn',
'sparse3d_reco',
'sparse3d_reco_chi2'],
'particles_label': [ 'parse_particle_points_with_tagging',
'sparse3d_pcluster',
'particle_corrected'],
'segment_label': [ 'parse_sparse3d_scn',
'sparse3d_pcluster_semantics_ghost']}},
'minibatch_size': 32,
'num_workers': 4,
'shuffle': False}}
Loading file: /sdf/home/l/ldomine/lartpc_mlreco3d_tutorials/book/data/wire_mpvmpr_2020_04_test_small.root
Loading tree sparse3d_reco
Loading tree sparse3d_reco_chi2
Loading tree sparse3d_pcluster_semantics_ghost
Loading tree cluster3d_pcluster
Loading tree particle_pcluster
Loading tree particle_mpv
Loading tree sparse3d_pcluster_semantics
Loading tree sparse3d_pcluster
Loading tree particle_corrected
Found 101 events in file(s)
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py:478: UserWarning:
This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
Iterate¶
One way to iterate through the dataset is using next
:
data = next(hs.data_io_iter)
You can see that data
is a dictionary whose keys match the names specified in the configuration block above:
data.keys()
dict_keys(['input_data', 'segment_label', 'cluster_label', 'particles_label', 'index'])
Taking a look at the images¶
It is time to run this configuration and see what we get out of it! We use Plotly to visualize the 3D images.
entry = 0
Let’s select the first entry.
clust_label = data['cluster_label'][data['cluster_label'][:, 3] == entry]
input_data = data['input_data'][data['input_data'][:, 3] == entry]
segment_label = data['segment_label'][data['segment_label'][:, 3] == entry, -1]
particles_label = data['particles_label'][data['particles_label'][:, 3] == entry]
Input data¶
Let us visualize the input_data
first:
These are the energy deposits detected by the LArTPC. The energy scale might be true if you are looking at true labels,
or it might be a reconstructed energy, depending on what you are loading into input_data
.
trace = []
trace+= scatter_points(input_data,markersize=1,color=input_data[:, -2], cmin=0,cmax=500)
trace[-1].name = 'input_data'
trace[-1].marker.colorscale='viridis'
trace+= scatter_points(input_data[segment_label < 5],markersize=1,color=input_data[segment_label < 5, -2], cmin=0,cmax=500)
trace[-1].name = 'input_data (true noghost)'
trace[-1].marker.colorscale='viridis'
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
Semantic labels¶
Let us look at the semantic labels next: for each voxel, there is a class label which takes integer values in [0, 5]
.
trace = []
trace+= scatter_points(input_data,markersize=1,color=segment_label, cmin=0,cmax=5, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'Semantic segmentation labels (including ghost points)'
trace[-1].marker.colorscale='viridis'
trace+= scatter_points(input_data[segment_label < 5],markersize=1,color=segment_label[segment_label < 5], cmin=0,cmax=5, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'Semantic segmentation labels (w/o ghost points)'
trace[-1].marker.colorscale='viridis'
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
Points of interest¶
Finally, particles_label
holds the coordinates of points of interest. They are displayed as big dots in the visualization.
The dots color corresponds to the true semantic class attributed to each point of interest.
trace = []
trace+= scatter_points(input_data[segment_label < 5],markersize=1,color=segment_label[segment_label < 5], cmin=0,cmax=5, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'Semantic segmentation labels (w/o ghost points)'
trace[-1].marker.colorscale='viridis'
trace += scatter_points(particles_label, markersize=5, color=particles_label[:, 4], cmin=0, cmax=10, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = "True point labels"
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
Particle instances¶
A particle instance is a cluster of voxels that belong to an individual particle. Each color here represents a different particle instance.
trace = []
trace+= scatter_points(clust_label[:, :3],markersize=1,color=clust_label[:, 6], cmin=0, cmax=50, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'True cluster labels'
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
Interaction groups¶
Particle instances can then be grouped into interactions. Each color is a different interaction in this visualization.
trace = []
trace+= scatter_points(clust_label[:, :3],markersize=1,color=clust_label[:, 7], cmin=0, cmax=50, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'True interaction labels'
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
Neutrino vs cosmics¶
There are two types of interactions: the ones due to a neutrino traversing the detector volume (i.e. our signal!), and the ones due to cosmic rays (background).
trace = []
trace+= scatter_points(clust_label[:, :3],markersize=1,color=clust_label[:, 8], cmin=0, cmax=50, colorscale=plotly.colors.qualitative.D3)
trace[-1].name = 'Nu / cosmic labels'
fig = go.Figure(data=trace,layout=plotly_layout3d())
fig.update_layout(legend=dict(x=1.0, y=0.8))
iplot(fig)
More about the I/O configuration¶
Keep reading if you want to understand more about that YAML configuration block.
lartpc_mlreco3d
expects a ROOT file (created with LArCV, a C++ library to process
LArTPC images) as input file. You can explore this file with ROOT alone, if you know
how to use ROOT, but the most intuitive way to visualize it is to use the I/O module
of lartpc_mlreco3d
.
What are parsers?¶
Images are stored in ROOT + LArCV format. Parsers are functions in lartpc_mlreco3d
that read the information stored in the ROOT file, select just what we are interested
in (e.g. particle information ? energy depositions ?) and format it in a friendly way
for our chain.
They only need to know the names of certain quantities stored in the ROOT file (as TTree, if you know what this is).
I/O Configuration - a brief gist¶
lartpc_mlreco3d
uses the YAML format for its configuration. Here is a skeleton config
that shows only the parameters that will matter the most for you:
iotool:
batch_size: 16
(...)
sampler:
name: RandomSequenceSampler
(...)
dataset:
(...)
data_keys:
- DATA_DIR/wire_mpvmpr_2020_04_test_small.root
(...)
schema:
input_data:
- parse_sparse3d_scn
- sparse3d_pcluster
(...)
The main things to pay attention to in this data I/O configuration block are:
batch size
randomization (default: none if sampler is commented out, enabled if you include the RandomSequenceSampler)
dataset filename
the schema, or list of parsers and their individual configurations
The schema has a simple format: it is a list where each item is formatted as follows:
my_custom_data_name:
- parse_whatever # this is the parser name
- sparse3d_pcluster # this string will be the parser's first argument
- sparse3d_reco # this string will be the parser's second argument
- # etc
The parser’s arguments are the names of quantities stored in the input ROOT file.
A real life example would look like this:
input_data:
- parse_sparse3d_scn
- sparse3d_pcluster
This tells us that we want a field called input_data
(our choice) in the input data
dictionary. For the sake of example let’s call this input dictionary data_blob
.
The parser name needs to be the first in the list that follows. Hence,
data_blob['input_data']
will be the output of the parser parse_sparse3d_scn
.
That parser will receive as arguments whatever names follow in the list - here, there
is just one, sparse3d_pcluster
.
You can have as many such items in the schema list - each of them will be available in the input data dictionary under the name that you specify.
Note: some stages of the chain might expect a specific name in the input dictionary. Sometimes this is configurable in the network configuration.
I/O Configuration - a real example¶
Let us show a real I/O configuration example:
iotool:
batch_size: 16
shuffle: False
num_workers: 8
collate_fn: CollateSparse
sampler:
name: RandomSequenceSampler
dataset:
name: LArCVDataset
data_keys:
- ./wire_mpvmpr_2020_04_test_small.root
limit_num_files: 10
schema:
input_data:
- parse_sparse3d_scn
- sparse3d_pcluster
segment_label:
- parse_sparse3d_scn
- sparse3d_pcluster_semantics
cluster_label:
- parse_cluster3d_clean_full
- cluster3d_pcluster
- particle_pcluster
- sparse3d_pcluster_semantics
particles_label:
- parse_particle_points
- sparse3d_pcluster
- particle_corrected
These are typical inputs that you would be looking at:
parse_sparse3d_scn
gets you the actual 3D image, the voxels and various values for each voxel. For each voxel it will parse as many features as you are providing branch names. Here each voxel ininput_data
will have a single feature coming fromsparse3d_pcluster
(the true energy depositions). Each voxel insegment_label
will have a single feature coming fromsparse3d_pcluster_semantics
(which holds the true semantic labels).parse_cluster3d_clean_full
is parsing particle-wise information and aggregating it. For each non-zero voxel it will provide you with energy deposit, cluster id, interaction id, neutrino vs cosmic label, etc.parse_particle_points
retrieves the coordinates and semantic class of points of interest for PPN.
Now you should be all set to browse through the images in a dataset and loop over them.