.. _components_label:

=============================
Components of Reproducible-ML
=============================

In this section, we will show how the components of a Machine Learning (ML) pipeline
are embedded in Reproducible-ML
and highlight their relationship towards each other. Similar to the overview in :ref:`concept_label` section
we start with basic data pre-processing and go through all steps of the pipeline until
we reach the output.

.. figure:: /images/overview/Overview.png
   :align: center

   Machine learning pipeline overview


Basic Data Pre-Processing
=========================

In Reproducible-ML the basic pre-processing is implemented with the help of two steps.
The Handler module which takes as input the raw data and outputs TensorFlow features and the
conversion which takes TensorFlow features and outputs TensorFlow record files. We call this
process serialization.

Storing the data in binary files (i.e. :file:`.tfrecords`) brings several advantages.
One of them is the image and its annotations are stored in a single block of memory
compared to storing each image and its annotations separately. Overall, using binary files
makes the reading of data much more efficient.


.. figure:: /images/Components/PreProcessing.png
   :align: center

   Detailed view of Basic Data Pre-Processing

Handlers
--------

Within the handler module, things like image scaling, slicing, reshaping etc. are done. Each handler
inherits from a base class called *FeatureHandler*. Have a look at :ref:`handler_label` for more
information concerning the base class. Lets consider an easy example, we have as raw input
images of size [91,91] and we want to rescale them to [128,128] and return the images in TensorFlow
bytes feature. The following code snipped shows such a Handler::

    class TestHandler(FeatureHandler):
        """Handler to rescale images from [91,91] to [128,128]

        Loads images using an returns byte features.

        Attributes:
            img_folder (str): Path to folder containing the images.
            img_shape (list of int): Desired shape for the images.
            img_dtype (str): Desired data type for the images.
        """
        def __init__(self, img_folder, img_shape, img_dtype):
            super(TestHandler, self).__init__(delegate_to=None, shape=[])
            self.img_folder = img_folder
            self.img_shape = img_shape
            self.img_dtype = getattr(np, img_dtype)

        def handle(self, row, key):
            img = np.load(img_path)

            if self.img_shape != list(img.shape):
                img = scipy.misc.imresize(img, self.img_shape)

            img_feature = {}
            img_feature[key] = utils.bytes_feature(img.tostring())

            return img_feature


The image features are grouped as python dictionaries and for this simple case the key could simply
be "image". The function which transforms the rescaled images to byte features is implemented in
the *Utils* module. Have a look at :ref:`utils_label` for other possible features.

Conversion
----------

The Conversion is simply done by binarizing the TensorFlow features and save them as
:file:`.tfrecods`. A template for the conversion step called *write_records* is presented
here::

    def write_records(record_dir,
                  record_pattern,
                  keys_to_handlers,
                  samples_per_record,
                  split_to_size,
                  rows_gen):


        record_pattern = mkdir_and_join(record_dir, record_pattern)

        for split in ["train", "eval", "test"]:

            rgen = rows_gen(split)

            n_records, n_samples_left = div_mod(split_to_size[split],
                samples_per_record)

            for idx in range(n_records):

                writer, record_path = get_writer(record_pattern, split, idx)
                write_samples(samples_per_record, rgen, keys_to_handlers, writer)

            if n_samples_left > 0:

                writer, _ = get_writer(record_pattern, split, n_records)
                write_samples(n_samples_left, rgen, keys_to_handlers, writer)

For more information concerning the  functions *div_mod*, *get_writer* and *write_samples* please
have a look at :ref:`module.serialize_label`, however the names should be self-explanatory.


Serialization
-------------

To finalize the basic data pre-processing we bring together the handler and the conversion what we then
call serialization. We need to choose which kind of handler we are going to use
e.g. *TestHandler* and what kind of serialization will take place i.e. serialization from a NPZ file::

    def config(npz_file):

        img_shape = [128, 128]
        img_dtype = "uint8"

        keys_to_descriptions = {"image": "Digit Image"}
        keys_to_handlers = {"image": TestHandler()}


    def serialize(npz_file, key):

        serialize_npz(npz_file=npz_file, key=key)

You can test the basic pre-processing step by producing :file:`.tfrecord` files from the MNIST
data set following the first two steps of :ref:`MNIST_label`.


Data Input Pipeline
===================

Since Reproducible-ML uses the TensorFlow Estimator Module :cite:`DBLP:journals/corr/abs-1708-02637`,
the first two phases **Extract** and **Transfrom** are captured in an input function. A simplified
input function could therefore be::


    def input_fn():

            def fn():

                dataset = dataset_from_records("/path/to/dataset/train-*.tfrecord")
                dataset = tf.data.TFRecordDataset(tf.data.TFRecordDataset)
                dataset = dataset.map(map_func=map_fn())
                dataset = dataset.batch(batch_size)
                dataset = shuffle_repeat_prefetch(dataset)

            return dataset

        return fn

The data is now ready for being loaded onto any accelerator device to execute the core ML.

The Core of Machine Learning
============================

This part is explained by means of a Generative Adversarial Network (GAN) :cite:`Gan`.
GANs are ML algorithms consisting of two contesting neural networks, a Generator and
a Discriminator/Critic. The Generator learns to generate samples while the Critic
tries to distinguish between generated and real samples. For more information about GANs
please refer to :cite:`Gan`.

As introduced in :ref:`concept_label` we follow the idea of
:cite:`Domingos:2012:FUT:2347736.2347755`, that a learning process consists of combinations
of three components:

**1. Representation:** The Generator and the Critic network. Here a simplified example of a
Generator network which takes in a noise vector and returns a generated 28x28 image::

    def generator_fn(gan_type):

        def generator(noise):

            with tf.name_scope("critic") as scope:

                layer = layers.fully_connected(noise, 1024)
                layer = layers.fully_connected(layer, 7*7*256)
                layer = tf.reshape(layer, [-1, 7, 7, 256])
                layer = layers.conv2d_transpose(layer, 64, [4,4], stride=2) # 14x14
                layer = layers.conv2d_transpose(layer, 32, [4,4], stride=2) # 28x28
                layer = layers.conv2d(layer, 1, [4,4], stride=1,
                                      activation_fn=tf.tanh,
                                      normalizer_fn=None)

                return layer
        return generator

And an example of a Critic network which takes as input an image and returns a value based on
which real and fake samples are be distinguished::

    def critic_fn(gan_type):

        def critic(images):

            with tf.name_scope("critic") as scope:

                layer = layers.conv2d(images, 64, [4,4], stride=2) # 14x14
                layer = layers.conv2d(layer, 128, [4,4], stride=2) # 7x7
                layer = layers.conv2d(layer, 1, [4,4], stride=2) # 7x7

                return tf.reduce_mean(layer)

        return critic


**2. Evaluation:** As evaluation metric we use the Wasserstein-loss introduced in :cite:`Wgan`,
in which an approximation of the earth mover's distance is used.

**3. Optimization:** To find the best target representation i.e. the best Generator and the best
Critic we use the RMSProp algorithm :cite:`DBLP:journals/corr/DauphinVCB15`.

It is important to notice that since Reproducible-ML is based on the TensorFlow Estimator,
changing **2.** and/or **3.** is very easy, therefore it makes Reproducible-ML flexible. At the same
time, change either of them you do not need to dig deeply in the code, which makes
it also very simple.
Let`s have a look at a possible training process::

    def main():

        config = tf.estimator.RunConfig(model_dir=model_dir)

        gan_estimator = tfgan.estimator.GANEstimator(
            generator_fn=generator_fn(gan_type),
            discriminator_fn=critic_fn(gan_type),
            generator_loss_fn=tfgan.losses.wasserstein_generator_loss,
            discriminator_loss_fn=tfgan.losses.wasserstein_discriminator_loss,
            generator_optimizer=tf.train.RMSPropOptimizer(gen_lr),
            discriminator_optimizer=tf.train.RMSPropOptimizer(crit_lr),
            config=config
        )
        gan_estimator.train(input_fn())

Have a look at :cite:`tflosses` for different kinds of evaluation metrics and
:cite:`TfOp` for other optimization techniques.

The output of this training process are generated images. Have a look at some
sample output :ref:`exp_label`, and then you can start with your own ML pipeline.


Experiment Tracking
===================

As mentioned in :ref:`concept_label` tracking of experiments is of great importance. To integrate
this part into our framework as easy as possible, we use Sacred. Following the
description of :cite:`Sacred` the core abstraction is the *Experiment* class
in combination with the *configuration* of an experiment.
Let`s have a look at a simple example::

    from sacred import Experiment

    ex = Experiment()

    @ex.config
    def config():
        samples_per_record = 10
        compression = "GZIP"


    @ex.automain
    def main(samples_per_record, compression):
        print(samples_per_record)
        print(compression)

You can know either run the code using the default settings:

.. code-block:: console

   python -m test_sacred -m sacred

which leads to the following output:

.. code-block:: console

   WARNING - test_sacred - No observers have been added to this run
   INFO - test_sacred - Running command 'main'
   INFO - test_sacred - Started run with ID "32"
   10
   GZIP
   INFO - test_sacred - Completed after 0:00:00

or run it from the command line:

.. code-block:: console

   python -m test_sacred with samples_per_record=20 compression="ZIP" -m sacred

leading to the following output:

.. code-block:: console

   WARNING - test_sacred - No observers have been added to this run
   INFO - test_sacred - Running command 'main'
   INFO - test_sacred - Started run with ID "33"
   20
   ZIP
   INFO - test_sacred - Completed after 0:00:00

We can open SacredBoard via:

.. code-block:: console

   sacredboard -m sacred

The output displayed on SacredBoard for the second run will therefore look like this:

.. figure:: /images/Components/sacred.png
   :align: center

   Print screen of ScaredBoard

There are also other nice and important features from Sacred which are used in Reproducible-ML.
Please have a look at :cite:`SacredOnline` for a complete documentation of Sacred.