Introduction
*******************

The core of this project is made up of three important components:

* The :class:`~core.datasample.DataSample` class holds all data relevant for exactly one simulation.
* A number of :class:`~core.pipeline_block.PipelineBlock` classes, each performing a single task for a simulation (like a necessary preprocessing step, or some metric calculation)
* The :class:`~core.pipeline.Pipeline` class which is responsible for handling all PipelineBlocks and running them all on each DataSample

The pipeline is parallelized using the `parsl library <https://parsl.readthedocs.io/en/stable/>`_, see :ref:`parsl <parsl-page>`.

While follwing this documentation, we strongly suggest to load the :code:`src/run_sample.py` as an simple first example.
Once some of the basic functionality is clear, you can look at :code:`src/run_nonrigid.py` or :code:`src/run_nonrigid_with_us_and_rendering` for more complete exampless.

This documentation is still incomplete, if you find errors or miss some important information, please `report it as an issue <https://gitlab.com/nct_tso_public/nonrigid-data-generation-pipeline>`_ , or, even better, create a pull request!


The DataSample
====================

The :class:`~core.datasample.DataSample` class represents a single simulation. It stores all of its information in a single folder and gives read and write access to these files.
A DataSample is considered "valid" as long as none of the PipelineBlocks have reported an issue with it. As soon as an issue is encountered (such as a simulation that doesn't converge, intersecting triangles etc.) the PipelineBlocks should raise an exception which gets stored with the DataSample. In this case, subsequent PipelineBlocks in the Pipeline will not be called on this DataSample.

The DataSample class also stores logs for the given sample, as well as meta-info such as statistics for easier access.

Note that the DataSample is the only thing that is passed into the :code:`PipelineBlock::run()` functions. This means it is passed between python processes by the parallelization library parsl. Usually, this does not change much for you as the user, but there may be edge cases where this is important. In python_apps, you can write things to the DataSample class (for example by adding a new file list). By returning the changed sample from these functions, we make sure that subsequent blocks can read the changes that you've made. See :ref:`data_sample` for more details.

The PipelineBlock
====================

Each functionality of the Pipeline is implemented in a subclass of the :class:`~core.pipeline_block.PipelineBlock` class. Examples could be:

* generating a mesh
* extracting a surface
* adding noise

The PipelineBlock may run python code or arbitraty bash code. Because of parallelization, each PipelineBlock should act independently (i.e. not referencing other PipelineBlocks).
When adding your own functionality, you should do this by subclassing the PipelineBlock. See :ref:`building_pipeline_blocks` for more details.

The Pipeline
====================

The :class:`~core.pipeline.Pipeline` class controls the process of calling each PipelineBlock on each DataSample. Depending on how the Pipeline is configured, this workload may be spread out across multiple threads/processes or computers. For a single DataSample, we guarantee that the blocks are called in order, but there is no guarantee that sample i will be finished before sample i+1 is finished (Note: we added the :code:`--run_sequential` flag to process samples in-order for debugging purposes).

The pipeline also aggregates the statistics from all samples into one single file for easier overview and - if you configure it to do so - generates plots of these statistics.