Virtual Research Environment (VRE)
PhenoMeNal provides Virtual Research Environments (VRE) for interoperable and scalable metabolomics analysis. End-users, such as researchers and research teams, educators, SMEs, and any other type of user, will be able to create, on-demand and through a simple user interface, an environment of tools, services, data supporting their research needs. Hardware setup and software deployment required to operate these facilities are completely transparent to the VRE and hence the users can focus on the analysis and not the technicalities (see Figure).
Figure: Responsibilities when carrying out contemporary metabolomics data analysis.
(Left:) Today’s situation: Scientists are responsible for everything, including the computer hardware, installing all necessary software, and carrying out the actual analysis. All execution is limited by the resources in the single computer.
(Right:) The PhenoMeNal approach: Software tools are available as containers without the need for installations, with data in agreed-upon interoperable file formats. The VRE can be started on single computers or on cloud resources, and the scientists benefit from only needing to deal with the analysis as the technical implementations are handled by the VRE.
The PhenoMeNal VRE portal provides a site that enables the user to interact with the components of PhenoMeNal to deploy its own VRE. See figure below
Figure: Main components of PhenoMeNal
Compute Infrastructure: creation of of Virtual Machine Images (VMIs) for the deployment of infrastructure on cloud providers, the essential building block where tools can run on top.
Containers: for all the required tools and documentation to allow external tools makers to containerize their own tools independently. This is a requirement for PhenoMeNal to be able to deploy a piece of software on top of the infrastructure.
Data: prepackaged in certain formats to be usable within PhenoMeNal provided by software that are containerised as well, and that will allow users to prepare their data for use within PhenoMeNal.
PhenoMeNal Architecture
The PhenoMeNal VRE consists of the following main components:
- Software tools which are standardised and wrapped as software containers
- Standardised and interoperable data formats
- VRE contextualisation scripts to launch it on an Infrastructure-as-a-Service (IaaS) resources from public providers such as Google Cloud Platform, Amazon Web services; private OpenStack installations, or standalone computers.
PhenoMeNal implements a microservices architecture, where data analysis consists of connecting tools together to form an analysis pipeline. Data formats are all agreed-upon and following open standards like mzML (Mass Spectrometry Markup Language), nmrML (NMR markup language) and ISA-Tab, enabling simplification of communication and data sent between tools. These tools are available as containers that can be easily deployed without manual installation and dependency management. These containers, in an elastic IT-environment, scale out to run analysis in parallel on multiple compute nodes. All technical details are transparent for the metabolomics researcher.
The stack-based diagram below explains the PhenoMeNal architecture and chosen implementations:
Figure: The PhenoMeNal architecture (right) with selected implementations, depicted as a stack diagram and aligned to general microservice-based architectures (left).
On the lowest level is the actual hardware; computer or a virtual cloud running on a cluster miles away. The user makes use of provisioning software to prepare and equip the virtual cluster with necessary software-layers. This often starts with a system kernel, which controls the very basics of the computer system. The kernel is the intermediary between the hardware (possibly virtual) and OS. It deals with resource management, load-balancing, runtime scheduling and more. Every single node runs its own kernel and OS, with a cluster OS layered on top as an abstraction-layer, making it appear as if all the nodes are part of one big computer. Combining the fundamental functions provided by the kernel with a Cluster OS of choice results in a virtual cluster with combined resources and the ability to split workloads between nodes as if they were all part of the same physical machine. The operating system then takes over and handles most of the communication.
With the operating system in place the desired services can be installed. In order to be able to mount and run containers containing microservices, a container engine is needed. The main function of it is supporting the launching, scaling, management and termination of its auxiliary containers. It is through the container engines API that all container orchestration software operates.
Containers are pieces or parts of a program running within a closed virtual environment containing only the files needed for it to function. This makes a container entirely independent of the surrounding software environment, which is advantageous because it can be moved to and run on any operating system having the required container engine. In this use-case where microservices are wrapped up in software containers this means they are easy to add, remove and rearrange for the desired workflow.
The microservices run within these containers are all independent functions, usually from existing software packages. Containerizing these functions comes with several benefits, where their quick launch is one of the most important. This results in fast and simple scalability as required, since additional virtual nodes can be added to the virtual cluster, provisioned with all the software needed and then supplied with the necessary container. In a fraction of the time it would take to build, configure and install additional physical machines, a virtual cluster can accommodate for heavier workloads.
The PhenoMeNal stack
PhenoMeNal is built to run on private machine as well as with any Infrastructure-as-a-Service-provider. It uses the MANTL suite of tools for most of its functions with Terraform as the infrastructure builder of choice. It gives the user simple script-based control over the launch and management of their infrastructure. Ansible is used as provisioning software, installing both kernel, operating system and engines. Mesos and/or Kubernetes gathers the cluster of nodes to a single workspace and functions as its kernel and OS. Docker is the container-environment of choice and Ansible supplies its engine along with the dependencies. Kubernetes and Mesos functions overlap but within PhenoMeNal the main function of Kubernetes is container orchestration. The desired analysis functions are downloaded as small independent Docker containers and mounted through Kubernetes’ orchestration tools. The Figure below shows an overview of the interacting components inside a running VRE.
Figure: Overview of the interacting components inside a running PhenoMeNal VRE deployed using Mesos. The control nodes are redundant services enabling the functions of the systems in a fault-tolerant way. The Edge nodes manage the network connection with the user. PhenoMeNal currently uses Jupyter and Galaxy as graphical web front-ends with traffic routed through these Edge nodes. Workflow engines such as Galaxy, manage dependency graphs and communicates with Kubernetes that handles the orchestration of containers using the docker engine.
Workflows
PhenoMeNal uses workflows as integrators of other VMIs and containers: graphical workflow designer (Galaxy) and textual workflow designer (Jupyter)
Galaxy is a workflow environment tool that allows researchers to concatenate common bioinformatics tools to create pipelines or workflows. It uses the original code and binaries of bioinformatics tools (developed elsewhere), and provides tool wrappers for them so that the Galaxy’s user interface and API can interact with those tools. In contrast to the classical installation of Galaxy where most tools would be executed serially on the same machine where Galaxy is running, PhenoMeNal enables scalable analysis on multiple compute nodes using microservices by connecting Galaxy to Kubernetes.
Figure: The flow implemented for deploying the Galaxy runtime into a Kubernetes (k8s) container orchestration (CO) system.
Initially, (1) the user requests Galaxy (through its UI or API) to run a job with certain data. (2) Definitions added to our Galaxy instance allows the implemented k8s Runner for Galaxy to map the tool required in the job to a container. All this information is passed by the k8s runner for Galaxy to the master node of the CO in the form a of k8s Job API object using the pykube Python library to communicate. (3) The master node allocates the k8s Job to a node, according to availability of resources. (4) The node, using the Job definition, requests the required container (if not available) image from the PhenoMeNal docker registry. (5) The node, with the container obtained, runs the k8s Job, while the k8s Runner for Galaxy constantly queries to the master about the status of the job. (6) Once ready and signalled by the runner, Galaxy collects the results through the shared filesystem, once requests to the k8s master’s REST API Endpoint shows that the job is done, and exposes them to the user.
Jupyter is a system to combine text (including e.g. mathematical equations) and code in an easy-to-read document that renders in a web browser. Within PhenoMeNal, we use Jupyter as one of the ways of consuming the microservices developed within the consortium. When launching the VRE, users can open Jupyter and then either invoke services directly in an interactive fashion, or schedule long-running jobs using a workflow system of their own.
Continous Integration System
PhenoMeNal hosts a Jenkins continuous integration system that serves as an integration point where source code is collected, tools are built, containers are assembled, tests can be run to ensure correctness and interoperability, and where results can be pushed to public or private registries.
Docker Registry
PhenoMeNal hosts a docker registry to make containers publicly available for the research community. Currently we have 29 containers hosted on our docker registry, listed in Table below:
| batman | galaxy-k8s-runtime | nmrglue |
| bioc_devel_base | ipo | nmrmlconv |
| bioc_devel_core | isajson-validator | nmrpro |
| biosigner | isatab-validator | pwiz |
| ex-bfr | isatab2json | rstudio |
| ex-blankfilter | iso2flux | rtest |
| ex-cv | json2isatab | univariate |
| ex-featureselection | lcmsmatching | |
| ex-log2transformation | metfrag-cli | |
| ex-merger | midcor | |
| ex-splitter | multivariate |
Table: List of docker containers in the PhenoMeNal docker registry
Any of these containers can be retrieved from any docker installation through the command:
docker pull docker-registry.phenomenal-h2020.eu/phnmnl/<container>
Public Galaxy instance
The PhenoMeNal Public Galaxy VRE runs on top of a Kubernetes cluster. The pre-provisioned PhenoMeNal Galaxy docker image is able to run inside a Kubernetes Replication Controller/Pod and communicates through the service account of Kubernetes with the master nodes to submit jobs to the cluster. This docker image contains all the tools that have been dockerized, “galaxified” and tested (currently manually, in the future via automatic integration tests in PhenoMeNal continuous integration system) with sample datasets to check that they work adequately. Within this public instance, we provide shared workflows and data sets within Galaxy, that any user can try on the instance.
