Docker for Bioinformatics

Portable, Scalable and Reproducible Bioinformatics workflows

Credit: Composite image by Abhilesh Dhawanjewar featuring models by Sean Kenney (Docker whale) and CaptainVerbalCelery (DNA helix, via LEGO IDEAS), with background from Adobe Stock.

Bioinformatics analysis often involves complex pipelines with rapidly evolving software tools, each with their own set of dependencies. System compatibility, version mismatches and dependency conflict issues can often be a nightmare, making running and sharing bioinformatic pipelines a challenging task. These challenges not only waste valuable research time but also contribute to irreproducible workflows, where results depend as much on the computing environment as on the analysis itself. Docker offers a powerful solution by packaging software and its dependencies into portable, reproducible containers, ensuring that your bioinformatics pipelines run consistently, whether on your local machine, an HPC cluster, or the cloud.

What is Docker?

Imagine you’re baking a cake, but every time you try, your kitchen is missing key ingredients or uses a different oven that bakes at the wrong temperature. Docker is like a self-contained baking kit that comes with all the right ingredients, tools, and even its own portable oven, ensuring your cake turns out exactly the same no matter where you bake it. In bioinformatics, Docker does the same for software by packaging tools, dependencies, and environments so that analyses run reliably across different computing platforms.

How can Docker help?

The FAIR (Findable, Accessible, Interoperable, and Reusable) principles provide guidelines for maximizing the value of research data. Docker aligns bioinformatics workflows with these principles by ensuring software and environments are portable and reproducible:

Getting Started

To begin, start by installing Docker on your system. Docker is available for all major operating systems and the installers can be downloaded from the official website. For Windows and macOS users, the recommended approach is to install the Docker Desktop application, while Linux users can install Docker natively for a more lightweight setup.Docker Desktop creates a Linux virtual machine (VM) on Windows and macOS to run containers, whereas on a Linux machine, Docker runs natively without the need for a VM.




To test whether Docker is installed correctly, run the following command in your terminal:

# Check Docker version
docker --version

# Test with a simple hello-world container
docker run hello-world

If installed correctly, these commands will print the version of Docker installed on your system and fetch the image and run the hello-world container, which prints a message confirming that Docker is working.

Understanding Key Docker Concepts

Running Bioinformatics Tools with Docker

We will use the popular tool samtools as an example to demonstrate how to run bioinformatics tools using Docker. samtools is a widely used tool for working with Sequence Alignment/Map (SAM) and Binary Alignment/Map (BAM) files.

  1. Pull a Docker Image

    Here, we will pull a Docker image for samtools from Docker Hub.

    # Pull the samtools image from Docker Hub
    docker pull biocontainers/samtools
    

    This command will download the samtools image and its dependencies to your local machine. We can then use the image to create containers.

  2. Run a single command non-interactively

    We will use samtools to view the first few lines of a BAM file. Replace /data_dir with the path to the folder containing your BAM file (align.bam)

    # Run samtools view on a BAM file
    docker run --rm biocontainers/samtools -v /data_dir:/data samtools view /data/align.bam | head
    
    • --rm: Removes the container after execution
    • -v /data_dir:/data: Mounts the local directory /data_dir to the container directory /data

    This command is useful for running single commands without needing an interactive shell. The --rm flag ensures that the container is removed after the command finishes.

  3. Run a command interactively

    We can also run an interactive shell within the container to execute multiple commands.

    # Start an interactive shell in the samtools container
    docker run -it --name samtools_container biocontainers/samtools -v /data_dir:/data /bin/bash
    
    • -it: Starts an interactive terminal session
    • /bin/bash: Launches the bash shell in the container

    Here, we also used the --name flag to give the container a name (samtools_container) for easy reference.

    We can now run multiple samtools commands within the container:

    # Check samtools version
    samtools --version
    
    # Index a reference genome
    samtools faidx /data/ref.fa
    

    We can add --rm to the interactive docker run command to remove the container after exiting the shell.

Composing Docker Workflows

Docker’s real power shines when we use it compose complex workflows with multiple tools. By chaining together containers, we can create reproducible pipelines that can be easily shared and run on different systems.

Let’s consider the first step of most bioinformatics workflows: quality control of sequencing reads. This step is often performed using tools like fastqcFastQC is a quality control tool that analyzes raw sequence data from high throughout sequencing runs with results conveniently summarized using tools like multiqcMultiQC aggregates results and quality metrics from multiple bioinformatics analysis reports (often including those from FastQC) into a single, interactive summary report, facilitating comparison across numerous samples or steps.. These tools can be run in a Docker container, allowing us to easily check the quality of our sequencing data.

To try out the pipeline with real data, we can use test FASTQC files from the nf-core/test-datasets repository that are ideal for quick pipeline tests. We can run the following commands one-by-one on the command line or save them in a bash script to download the test data to the ~/docker-bioinf/data/raw_data directory. You can replace this with any other directory of your choice.

# Create directory for test data
mkdir -p ~/docker-bioinf/data/raw_data
cd ~/docker-bioinf/data/raw_data

# Download test FASTQ files
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fastq/test_1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fastq/test_2.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fastq/test2_1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/illumina/fastq/test2_2.fastq.gz

Check the contents of the ~/docker-bioinf/data/raw_data directory to confirm that the files have been downloaded successfully.

# Check the contents of the directory
ls ~/docker-bioinf/data/raw_data

Bioinformatics Pipeline using bash scripts

We can chain together multiple docker commands to construct a lightweight, portable pipeline for quality control of sequencing reads. The following bash script demonstrates how to run fastqc on all FASTQ files in the ~/docker-bioinf/data/raw_data directory and generate a summary report using multiqc. The script will create a new directory called qc_reports to store the output reports.

#!/bin/bash
# Create the output directory
mkdir -p ~/docker-bioinf/data/qc_reports

# Pull the FastQC and MultiQC containers
docker pull quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0
docker pull quay.io/biocontainers/multiqc:1.28--pyhdfd78af_0

# Run FastQC on all FASTQ files in the raw_data directory
docker run --rm -v ~/docker-bioinf/data:/data quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0 \
  bash -c 'fastqc /data/raw_data/*.fastq.gz -o /data/qc_reports'

# Run MultiQC to aggregate FastQC reports
docker run --rm -v ~/docker-bioinf/data:/data quay.io/biocontainers/multiqc:1.28--pyhdfd78af_0 \
  multiqc /data/qc_reports -o /data/qc_reports

Note the use of bash -c when running the fastqc command which ensures that the command is executed in a shell environment thereby enabling the expansion of wildcards (e.g. *fastq.gz)

The fastqc and multiqc reports will be saved in the ~/docker-bioinf/data/qc_reports directory, and you can view them using any web browser. The fastqc reports will be in HTML format, while the multiqc report will be an interactive HTML file (multiqc_report.html) that aggregates the results from all the fastqc reports.

Bioinformatics Pipeline using docker compose

For more complex workflows, Docker Compose provides a convenient way to define and run multi-container steps with built-in dependency management. The images and commands for the bioinformatic tools can be defined in a docker-compose.yml file, making it easier to manage and reproduce.

Let’s revisit the earlier bioinformatics task: running FastQC on raw FASTQ files and summarizing the results using MultiQC. Instead of invoking each tool manually with separate docker run commands, we can streamline the workflow using a docker-compose.yml file:

services:
  fastqc:
    image: quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0
    volumes:
      - ~/docker-bioinf/data:/data
    entrypoint: bash -c
    command: >
      "mkdir -p /data/qc_reports &&
      fastqc /data/raw_data/*.fastq.gz -o /data/qc_reports"

  multiqc:
    image: quay.io/biocontainers/multiqc:1.28--pyhdfd78af_0
    volumes:
      - ~/docker-bioinf/data:/data
    command: multiqc /data/qc_reports -o /data/qc_reports
    depends_on:
      fastqc:
        condition: service_completed_successfully

This docker-compose.yml file defines the following key sections:

To run the pipeline, execute the following command in the same directory as the docker-compose.yml file:

docker-compose up

💡 Tip: If you’re re-running the pipeline and want a clean start, use docker compose down to remove the containers, or add the --force-recreate flag when running up (e.g. docker compose up --force-recreate).

Beyond Pre-Built Images: Introducing Dockerfiles

While pre-built Docker images are incredibly helpful, they may not always meet your specific needs. Suppose you want to install a specific version of a tool or dependency, package custom scripts along with the tools or simply that the pre-built image is not available. In such cases, you can create your own Docker images using a Dockerfile, which is a text file that contains instructions for building a Docker image. It specifies the base image, the software to install, and any configuration needed to set up the environment.

We can create our own custom docker image for the same fastqc and multiqc pipeline using a Dockerfile. This provides us complete control to customize the environment, install additional dependencies, and package our scripts along with the tools.

# Use lightweight linux base
FROM debian:bullseye-slim

# Prevent interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    openjdk-11-jdk \
    python3 \
    python3-pip \
    bash \
    wget \
    unzip \
    perl \
    libperl-dev && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install FastQC
RUN wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip && \
    unzip fastqc_v0.12.1.zip && \
    mv FastQC /opt/fastqc && \
    chmod +x /opt/fastqc/fastqc && \
    ln -s /opt/fastqc/fastqc /usr/local/bin/fastqc && \
    rm fastqc_v0.12.1.zip

# Install MultiQC
RUN pip3 install --no-cache-dir multiqc

# Set working directory
WORKDIR /data

# Make shell commands easier to write
ENTRYPOINT ["bash", "-c"]

The Dockerfile contains the instructions to build an environment with the tools and necessary dependencies for our analysis, the key sections are:

A complete list of Dockerfile instructions can be found in the Dockerfile reference.

Next, we build the Docker image using the docker build command. The -t flag allows us to tag the image with a name (e.g., my_fastqc_multiqc).

# Build the Docker image
docker build -t my_fastqc_multiqc .

Once the image is built, we can run it using the same commands as before. The only difference is that we will use our custom image name instead of the pre-built one.

mkdir -p ~/docker-bioinf/data/qc_reports

# Run FastQC on all FASTQ files in the raw_data directory
docker run --rm -v ~/docker-bioinf/data:/data my_fastqc_multiqc \
  bash -c 'mkdir -p /data/qc_reports && fastqc /data/raw_data/*.fastq.gz -o /data/qc_reports'

# Run MultiQC to aggregate FastQC reports
docker run --rm -v ~/docker-bioinf/data:/data my_fastqc_multiqc \
  multiqc /data/qc_reports -o /data/qc_reports

Best Practices for Docker in Bioinformatics

  1. Always Use Specific Image Versions:

    Use a versioned image tag (like quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0) instead of the latest tag. This ensures your workflow always uses the exact same version of the tool every time it’s run and avoids unexpected changes in behavior, guaranteeing reproducibility.

  2. Leverage Biocontainers:

    Before searching elsewhere or attempting to build an image, check repositories like Quay.io/biocontainers. Utilizing these standardized, pre-built images for common bioinformatics tools saves significant effort and aligns your workflow with community standards.

  3. Handle Data Appropriately:

    • Separate data from the container: Use volumes to mount data directories from the host system into the container, rather than copying it to the container’s filesystem. This keeps the container immutable and promotes reusability with different datasets.
    • Mount reference data as read-only volumes to prevent accidental modifications.
    • Use named volumes for persisting data between container runs and to share data between multiple containers.

Finding Bioinformatics Tool Containers

These registries host a large number of pre-built Docker images for bioinformatics tools:

Resources and Further Reading

Docker Essentials and Learning:

Guided Lessons:

Reproducibility in Research Practices:

Conclusion

Bioinformatic workflows often suffer from “dependency hell”, wherein conflicts between software libraries, incompatible versions and platform-specific quirks can make setting up and running analyses a frustrating experience. Containerization technologies like Docker provide a powerful solution by encapsulating the software and it’s dependencies along with any necessary configurations into a single, portable package. This ensures that the analysis runs consistently across different environments, making the workflows more reproducible, scalable, and shareable. Whether you’re running a single tool or a complex pipeline, Docker ensures that your research remains reliable and accessible across different environments.

Happy Dockering!

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Beats of Stress