Parsl
The pipeline is intended to be run in parallel. We use the parsl library for this, and we recommend reading through at least some of its documentation. We parallelize at the PipelineBlock level, i.e. whenever a PipelineBlock processes a Datasample, this process is run in its own python worker, potentially on a separate cluster node from the one on which you started the script. At the moment, we support local running (use –run_local command line flag) or slurm (default), but other parsl configurations can be added easily (see core/parsl_config.py). You can also disable parallelization via the –run_sequential flag (recommended for debugging/developing).
Note that at the moment, the same python environment with all dependencies must be source-able both on the system where you start the script as well as on the cluster nodes. See core/parsl_config.py for details.
Also note that all used cluster nodes must have access to the output folder and it must be mounted (or sym-linked) under the same path. Theoretically, parsl can take care of passing files around between systems automatically, but we do not use this functionality at the moment.
App-Dependency vs Data-Dependency
Parsl can work with data-dependencies (an app can run when all the files and parameters it needs are ready) or app-dependency (and app runs when the previous apps are done). Since our PipelineBlocks may generate files that aren’t known previously, i.e. the block itself may decide how many files it generates (this is especially true for the simulation block, which may output files for N frames until convergence, where N cannot be known before the actual simulation), data-dependency is usually impractical. Thus, we use app-dependency.
Cachine and Checkpointing:
In general, we have caching enabled. This means that when you start the pipeline multiple times, parsl will only re-run an app (i.e. a PipelineBlock’s run function) if the input (i.e. the DataSample) has changed. We note that it is quite difficult to determine what exactly “changed” means in this context, i.e. there may be changes (changing time stamps or logs for example) that can be done to the DataSample which should not be considered a “change” by the caching system. In practice, we use the DataSample.serialization_info() function to determine which parts of the DataSample should be considered while calculating whether a change has occurred. If you modify the DataSample class, you will likely also need to modify this function. During development, it may be useful to disable caching. Use the –disable_caching command line flag.
Due to the topics discussed in App-Dependency vs Data-Dependency, checkpointing does not know each file that is generated. This means that if you delete an output file after a run, parsl will not know about it and might not re-run the corresponding task. After deleting files, you may thus need to manually disable caching for the next run to re-generate the files.
Troubleshooting:
It is very likely that first parallel runs will fail, for example due to the environemnt not being set up correctly on the remote nodes.
We’re still in the process of making this more transparent, for now it is best to look closely at logs that parsl writes to DATA_PATH/runinfo/RUN_ID, and using --run_sequential and --run_local for debugging.
Inside parsl’s log directory there are general logs, but also startup-files which are used to configure the environment (ensure these make sense and, for example, load conda correctly or set your $PYTHON_PATH correctly) and error logs that show what went wrong on the remote cluster. For further debugging, we point to the official parsl docs .