.. _components_label: ============================= Components of Reproducible-ML ============================= In this section, we will show how the components of a Machine Learning (ML) pipeline are embedded in Reproducible-ML and highlight their relationship towards each other. Similar to the overview in :ref:`concept_label` section we start with basic data pre-processing and go through all steps of the pipeline until we reach the output. .. figure:: /images/overview/Overview.png :align: center Machine learning pipeline overview Basic Data Pre-Processing ========================= In Reproducible-ML the basic pre-processing is implemented with the help of two steps. The Handler module which takes as input the raw data and outputs TensorFlow features and the conversion which takes TensorFlow features and outputs TensorFlow record files. We call this process serialization. Storing the data in binary files (i.e. :file:`.tfrecords`) brings several advantages. One of them is the image and its annotations are stored in a single block of memory compared to storing each image and its annotations separately. Overall, using binary files makes the reading of data much more efficient. .. figure:: /images/Components/PreProcessing.png :align: center Detailed view of Basic Data Pre-Processing Handlers -------- Within the handler module, things like image scaling, slicing, reshaping etc. are done. Each handler inherits from a base class called *FeatureHandler*. Have a look at :ref:`handler_label` for more information concerning the base class. Lets consider an easy example, we have as raw input images of size [91,91] and we want to rescale them to [128,128] and return the images in TensorFlow bytes feature. The following code snipped shows such a Handler:: class TestHandler(FeatureHandler): """Handler to rescale images from [91,91] to [128,128] Loads images using an returns byte features. Attributes: img_folder (str): Path to folder containing the images. img_shape (list of int): Desired shape for the images. img_dtype (str): Desired data type for the images. """ def __init__(self, img_folder, img_shape, img_dtype): super(TestHandler, self).__init__(delegate_to=None, shape=[]) self.img_folder = img_folder self.img_shape = img_shape self.img_dtype = getattr(np, img_dtype) def handle(self, row, key): img = np.load(img_path) if self.img_shape != list(img.shape): img = scipy.misc.imresize(img, self.img_shape) img_feature = {} img_feature[key] = utils.bytes_feature(img.tostring()) return img_feature The image features are grouped as python dictionaries and for this simple case the key could simply be "image". The function which transforms the rescaled images to byte features is implemented in the *Utils* module. Have a look at :ref:`utils_label` for other possible features. Conversion ---------- The Conversion is simply done by binarizing the TensorFlow features and save them as :file:`.tfrecods`. A template for the conversion step called *write_records* is presented here:: def write_records(record_dir, record_pattern, keys_to_handlers, samples_per_record, split_to_size, rows_gen): record_pattern = mkdir_and_join(record_dir, record_pattern) for split in ["train", "eval", "test"]: rgen = rows_gen(split) n_records, n_samples_left = div_mod(split_to_size[split], samples_per_record) for idx in range(n_records): writer, record_path = get_writer(record_pattern, split, idx) write_samples(samples_per_record, rgen, keys_to_handlers, writer) if n_samples_left > 0: writer, _ = get_writer(record_pattern, split, n_records) write_samples(n_samples_left, rgen, keys_to_handlers, writer) For more information concerning the functions *div_mod*, *get_writer* and *write_samples* please have a look at :ref:`module.serialize_label`, however the names should be self-explanatory. Serialization ------------- To finalize the basic data pre-processing we bring together the handler and the conversion what we then call serialization. We need to choose which kind of handler we are going to use e.g. *TestHandler* and what kind of serialization will take place i.e. serialization from a NPZ file:: def config(npz_file): img_shape = [128, 128] img_dtype = "uint8" keys_to_descriptions = {"image": "Digit Image"} keys_to_handlers = {"image": TestHandler()} def serialize(npz_file, key): serialize_npz(npz_file=npz_file, key=key) You can test the basic pre-processing step by producing :file:`.tfrecord` files from the MNIST data set following the first two steps of :ref:`MNIST_label`. Data Input Pipeline =================== Since Reproducible-ML uses the TensorFlow Estimator Module :cite:`DBLP:journals/corr/abs-1708-02637`, the first two phases **Extract** and **Transfrom** are captured in an input function. A simplified input function could therefore be:: def input_fn(): def fn(): dataset = dataset_from_records("/path/to/dataset/train-*.tfrecord") dataset = tf.data.TFRecordDataset(tf.data.TFRecordDataset) dataset = dataset.map(map_func=map_fn()) dataset = dataset.batch(batch_size) dataset = shuffle_repeat_prefetch(dataset) return dataset return fn The data is now ready for being loaded onto any accelerator device to execute the core ML. The Core of Machine Learning ============================ This part is explained by means of a Generative Adversarial Network (GAN) :cite:`Gan`. GANs are ML algorithms consisting of two contesting neural networks, a Generator and a Discriminator/Critic. The Generator learns to generate samples while the Critic tries to distinguish between generated and real samples. For more information about GANs please refer to :cite:`Gan`. As introduced in :ref:`concept_label` we follow the idea of :cite:`Domingos:2012:FUT:2347736.2347755`, that a learning process consists of combinations of three components: **1. Representation:** The Generator and the Critic network. Here a simplified example of a Generator network which takes in a noise vector and returns a generated 28x28 image:: def generator_fn(gan_type): def generator(noise): with tf.name_scope("critic") as scope: layer = layers.fully_connected(noise, 1024) layer = layers.fully_connected(layer, 7*7*256) layer = tf.reshape(layer, [-1, 7, 7, 256]) layer = layers.conv2d_transpose(layer, 64, [4,4], stride=2) # 14x14 layer = layers.conv2d_transpose(layer, 32, [4,4], stride=2) # 28x28 layer = layers.conv2d(layer, 1, [4,4], stride=1, activation_fn=tf.tanh, normalizer_fn=None) return layer return generator And an example of a Critic network which takes as input an image and returns a value based on which real and fake samples are be distinguished:: def critic_fn(gan_type): def critic(images): with tf.name_scope("critic") as scope: layer = layers.conv2d(images, 64, [4,4], stride=2) # 14x14 layer = layers.conv2d(layer, 128, [4,4], stride=2) # 7x7 layer = layers.conv2d(layer, 1, [4,4], stride=2) # 7x7 return tf.reduce_mean(layer) return critic **2. Evaluation:** As evaluation metric we use the Wasserstein-loss introduced in :cite:`Wgan`, in which an approximation of the earth mover's distance is used. **3. Optimization:** To find the best target representation i.e. the best Generator and the best Critic we use the RMSProp algorithm :cite:`DBLP:journals/corr/DauphinVCB15`. It is important to notice that since Reproducible-ML is based on the TensorFlow Estimator, changing **2.** and/or **3.** is very easy, therefore it makes Reproducible-ML flexible. At the same time, change either of them you do not need to dig deeply in the code, which makes it also very simple. Let`s have a look at a possible training process:: def main(): config = tf.estimator.RunConfig(model_dir=model_dir) gan_estimator = tfgan.estimator.GANEstimator( generator_fn=generator_fn(gan_type), discriminator_fn=critic_fn(gan_type), generator_loss_fn=tfgan.losses.wasserstein_generator_loss, discriminator_loss_fn=tfgan.losses.wasserstein_discriminator_loss, generator_optimizer=tf.train.RMSPropOptimizer(gen_lr), discriminator_optimizer=tf.train.RMSPropOptimizer(crit_lr), config=config ) gan_estimator.train(input_fn()) Have a look at :cite:`tflosses` for different kinds of evaluation metrics and :cite:`TfOp` for other optimization techniques. The output of this training process are generated images. Have a look at some sample output :ref:`exp_label`, and then you can start with your own ML pipeline. Experiment Tracking =================== As mentioned in :ref:`concept_label` tracking of experiments is of great importance. To integrate this part into our framework as easy as possible, we use Sacred. Following the description of :cite:`Sacred` the core abstraction is the *Experiment* class in combination with the *configuration* of an experiment. Let`s have a look at a simple example:: from sacred import Experiment ex = Experiment() @ex.config def config(): samples_per_record = 10 compression = "GZIP" @ex.automain def main(samples_per_record, compression): print(samples_per_record) print(compression) You can know either run the code using the default settings: .. code-block:: console python -m test_sacred -m sacred which leads to the following output: .. code-block:: console WARNING - test_sacred - No observers have been added to this run INFO - test_sacred - Running command 'main' INFO - test_sacred - Started run with ID "32" 10 GZIP INFO - test_sacred - Completed after 0:00:00 or run it from the command line: .. code-block:: console python -m test_sacred with samples_per_record=20 compression="ZIP" -m sacred leading to the following output: .. code-block:: console WARNING - test_sacred - No observers have been added to this run INFO - test_sacred - Running command 'main' INFO - test_sacred - Started run with ID "33" 20 ZIP INFO - test_sacred - Completed after 0:00:00 We can open SacredBoard via: .. code-block:: console sacredboard -m sacred The output displayed on SacredBoard for the second run will therefore look like this: .. figure:: /images/Components/sacred.png :align: center Print screen of ScaredBoard There are also other nice and important features from Sacred which are used in Reproducible-ML. Please have a look at :cite:`SacredOnline` for a complete documentation of Sacred.