Streaming Input Dataset

We already covered the basics of an iterable dataset (Python basics) and pytorch’s DataLoader (pytorch introduction). In this notebook, we introduce a few datasets that will be used during the hands-on sessions, and practic looping over the dataset using DataLoader!

import torch
import numpy as np
SEED=123
np.random.seed(SEED)
torch.manual_seed(SEED)
<torch._C.Generator at 0x7fe5c4b35110>

MNIST dataset

MNIST is widely used for an introductory machine learning (ML) courses/lectures. Most, if not all, ML libraries provide an easy way (API) to access MNIST and many publicly available dataset. This is true in pytorch as well. MNIST dataset in Dataset instance is available from torchvision.

Creating MNIST Dataset

A torchvision is a supporting module that has many image-related APIs including an interface (and management) of MNIST dataset. Let’s see how we can construct:

import os
from torchvision import datasets, transforms
# Data file download directory
LOCAL_DATA_DIR = './mnist-data'
os.makedirs(LOCAL_DATA_DIR,exist_ok=True)
# Use prepared data handler from pytorch (torchvision)
dataset = datasets.MNIST(LOCAL_DATA_DIR, train=True, download=True,
                         transform=transforms.Compose([transforms.ToTensor()]))
/usr/local/lib/python3.8/dist-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)

Here, MNIST is also a type Dataset (how? through class inheritance). All torch Dataset instance have tow useful and common functions: the length representations and data element access via index.

print( len(dataset)  )
print( type(dataset[0]) )
60000
<class 'tuple'>

That being said, how each data element is presented depends on a particular Dataset implementation. In case of MNIST, it is a tuple of length 2: data and label.

ENTRY=0
data, label = dataset[ENTRY]
print('Type of data  :', type(data),  'shape', data.shape)
print('Type of label :', type(label), 'value', label)
Type of data  : <class 'torch.Tensor'> shape torch.Size([1, 28, 28])
Type of label : <class 'int'> value 5

MNIST is an image of a hand-written digit in 28x28 pixels, gray scale. Note that the data shape is [1,28,28]. This follows the convention in Pytorch for image data represented as \((Cannel,Height,Width)\), or in short \((C,H,W)\). Let’s visualize using matplotlib.pyplot.imshow. This function can take \((H,W)\) of a gray scale image.

import matplotlib.pyplot as plt
%matplotlib inline

# Draw data
data = data.reshape(data.shape[1:])
plt.imshow(data,cmap='gray')
plt.show()
../_images/Python-06-MNIST-CIFAR10_9_0.png

Let us define a function that can list images and labels in the dataset.

def plot_dataset(dataset,num_image_per_class=10):
    import numpy as np
    num_class = 0
    classes = []
    if hasattr(dataset,'classes'):
        classes=dataset.classes
        num_class=len(classes)
    else: #brute force
        for data,label in dataset:
            if label in classes: continue
            classes.append(label)
        num_class=len(classes)
    
    shape = dataset[0][0].shape
    big_image = np.zeros(shape=[3,shape[1]*num_class,shape[2]*num_image_per_class],dtype=np.float32)
    
    finish_count_per_class=[0]*num_class
    for data,label in dataset:
        if finish_count_per_class[label] >= num_image_per_class: continue
        img_ctr = finish_count_per_class[label]
        big_image[:,shape[1]*label:shape[1]*(label+1),shape[2]*img_ctr:shape[2]*(img_ctr+1)]=data
        finish_count_per_class[label] += 1
        if np.sum(finish_count_per_class) == num_class*num_image_per_class: break
    import matplotlib.pyplot as plt
    fig,ax=plt.subplots(figsize=(8,8),facecolor='w')
    ax.tick_params(axis='both',which='both',bottom=False,top=False,left=False,right=False,labelleft=False,labelbottom=False)
    plt.imshow(np.transpose(big_image,(1,2,0)))
    for c in range(len(classes)):
        plt.text(big_image.shape[1]+shape[1]*0.5,shape[2]*(c+0.6),str(classes[c]),fontsize=16)
    plt.show()

Visualize!

plot_dataset(dataset)
../_images/Python-06-MNIST-CIFAR10_13_0.png

Creating DataLoader

Since the MNIST dataset is an iteratable one, we can create pytorch DataLoader!

import torch
loader = torch.utils.data.DataLoader(dataset,
                                     batch_size=20,
                                     shuffle=True,
                                     num_workers=1,
                                     pin_memory=True)

Review: the first argument is you dataset, and it can be anything but requires two attributes: __len__ and __getitem__. In case you wonder, these attributes allow you to call len(dataset) and access dataset elements by dataset[X] where X is an index integer.

Details (ignore if wished): other constructor arguments

The other constructor arguments used above are:

  • batch_size … the same of the subset data to be provided at once

  • shuffle … whether or not to randomize the choice of subset dataset (False will provide dataset

  • num_workers … number of parallel data-reader processes to be run (for making data read faster using multiprocessing module)

  • pin_memory … speed up data transfer to GPU by avoiding a necessity to copy data from pageable memory to page-locked (pinned) memory. Read here for more details. If you are not sure about the details, set to True when using GPU.

Data streaming with DataLoader

So let’s play with it! First of all, it has the concept of “length”.

print('length of DataLoader:',len(loader))
print('By the way, batch size * length =', 20 * len(loader))
length of DataLoader: 3000
By the way, batch size * length = 60000

We know the data total statistics is 60,000 which coincides with the length of DataLoader instance and the batch size where the latter is the unit of batch data. Yep, as you guessed, DataLoader is iterable:

# Create an iterator for playin in this notebook
from itertools import cycle
iter = cycle(loader)

for i in range(10):
    batch = next(iter)    
    print('Iteration',i)
    print(batch[1]) # accessing the labels
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_46560/3328099877.py in <module>
      4 
      5 for i in range(10):
----> 6     batch = next(iter)
      7     print('Iteration',i)
      8     print(batch[1]) # accessing the labels

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
   1201             else:
   1202                 del self._task_info[idx]
-> 1203                 return self._process_data(data)
   1204 
   1205     def _try_put_index(self):

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1227         self._try_put_index()
   1228         if isinstance(data, ExceptionWrapper):
-> 1229             data.reraise()
   1230         return data
   1231 

/usr/local/lib/python3.8/dist-packages/torch/_utils.py in reraise(self)
    423             # have message field
    424             raise self.exc_type(message=msg)
--> 425         raise self.exc_type(msg)
    426 
    427 

RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
    data = pin_memory(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return [pin_memory(sample) for sample in data]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 58, in <listcomp>
    return [pin_memory(sample) for sample in data]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
    return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

… and this is how data looks like:

print('Shape of an image batch data',batch[0].shape)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_46560/911477057.py in <module>
----> 1 print('Shape of an image batch data',batch[0].shape)

NameError: name 'batch' is not defined

… which is quite naturally 20 of 28x28 image

CIFAR10

CIFAR10 is yet another public dataset of 32x32 pixels RGB photographs. It contains 10 classes like MNIST but it is much more complicated than a gray scale, hand-written digits.

from torchvision import datasets, transforms
# Data file download directory
LOCAL_DATA_DIR = './cifar10-data'
# Create the dataset
dataset = datasets.CIFAR10(LOCAL_DATA_DIR, train=True, download=True,
                           transform=transforms.Compose([transforms.ToTensor()]))

plot_dataset(dataset)
Files already downloaded and verified
../_images/Python-06-MNIST-CIFAR10_23_1.png

Nothing new in terms of how-to, but let’s also create a DataLoader with CIFAR10.

loader = torch.utils.data.DataLoader(dataset,batch_size=8,shuffle=True,num_workers=1,pin_memory=True)

batch = next(cycle(loader))
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_46560/1749979711.py in <module>
      1 loader = torch.utils.data.DataLoader(dataset,batch_size=8,shuffle=True,num_workers=1,pin_memory=True)
      2 
----> 3 batch = next(cycle(loader))

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
   1201             else:
   1202                 del self._task_info[idx]
-> 1203                 return self._process_data(data)
   1204 
   1205     def _try_put_index(self):

/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1227         self._try_put_index()
   1228         if isinstance(data, ExceptionWrapper):
-> 1229             data.reraise()
   1230         return data
   1231 

/usr/local/lib/python3.8/dist-packages/torch/_utils.py in reraise(self)
    423             # have message field
    424             raise self.exc_type(message=msg)
--> 425         raise self.exc_type(msg)
    426 
    427 

RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
    data = pin_memory(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return [pin_memory(sample) for sample in data]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 58, in <listcomp>
    return [pin_memory(sample) for sample in data]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
    return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

Let’s take a look at the batch data. Recall the shape of this image \((C,H,W)\) where matplotlib.pyplot.imshow takes the format \((H,W,C)\) just like how an ordinary photograph is presented. We use torch.permute function to swap the axis.

photos,labels=batch
for idx in range(len(photos)):
    photo = photos[idx].permute(1,2,0)
    label = labels[idx]
    print(dataset.classes[label])
    plt.imshow(photo)
    plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_46560/1216954958.py in <module>
----> 1 photos,labels=batch
      2 for idx in range(len(photos)):
      3     photo = photos[idx].permute(1,2,0)
      4     label = labels[idx]
      5     print(dataset.classes[label])

NameError: name 'batch' is not defined