Toward Efficient Large-Scale Data Processing using Computational Workflow Systems
John Darrell Van Horn, Ph.D.


Numerous software packages now exist for the analysis of human neuroimaging data.  Thought, while many are meant to be used as self-contained, end-to-end analysis solutions, many such tools have aspects which a researcher would either like to harness (in the case where the software is particularly accurate and robust) or avoid (where the software might not perform efficiently or known to be error-prone).  Rather, users might wish to break the boundaries imposed by monolithic software solutions and create heterogeneous data processing workflows which take advantage of the best aspects of different tools.  In so doing, user-friendly software environments are needed to link the processing steps provided by one tool and link its outputs as the inputs to another, different step.  The result would be an efficient, multi-tool workflow, capable of processing large quantities of neuroimaging data on cloud or cluster computing resources.  The LONI Pipeline, in its sixth revision, offers this flexible approach and has been employed in large-scale analyses contributing to hundreds of peer-reviewed research articles.  It allows users to rapidly create data processing workflows via a streamlined graphical interface in which software library “modules” may be linked into data processing streams.  Processing jobs are managed in parallel by multi-CPU compute systems.  Here, I will provide the background and rationale behind the development of LONI Pipeline, showcase some of its core functionality, and illustrate results from several key neuroimaging studies.  Such a system is ideally suited for individual laboratories as well as multi-center collaboratives and any project confronted by the need for large-scale data processing.