Docker for Bioinformatics

Portable, Scalable and Reproducible Bioinformatics workflows

Bioinformatics analysis often involves complex pipelines with rapidly evolving software tools, each with their own set of dependencies. System compatibility, version mismatches and dependency conflict issues can often be a nightmare, making running and sharing bioinformatic pipelines a challenging task. These challenges not only waste valuable research time but also contribute to irreproducible workflows, where results depend as much on the computing environment as on the analysis itself. Docker offers a powerful solution by packaging software and its dependencies into portable, reproducible containers—ensuring that your bioinformatics pipelines run consistently, whether on your local machine, an HPC cluster, or the cloud.

What is Docker?

Imagine you’re baking a cake, but every time you try, your kitchen is missing key ingredients or uses a different oven that bakes at the wrong temperature. Docker is like a self-contained baking kit that comes with all the right ingredients, tools, and even its own portable oven, ensuring your cake turns out exactly the same no matter where you bake it. In bioinformatics, Docker does the same for software by packaging tools, dependencies, and environments so that analyses run reliably across different computing platforms.

How can Docker help?

The FAIR (Findable, Accessible, Interoperable, and Reusable) principles provide guidelines for maximizing the value of research data. Docker aligns bioinformatics workflows with these principles by ensuring software and environments are portable and reproducible:

Getting Started

To begin, start by installing Docker on your system. Docker is available for all major operating systems and the installers can be downloaded from the official website. For Windows and macOS users, the recommended approach is to install the Docker Desktop application, while Linux users can install Docker natively for a more lightweight setup.Docker Desktop creates a Linux virtual machine (VM) on Windows and macOS to run containers, whereas on a Linux machine, Docker runs natively without the need for a VM.

To test whether Docker is installed correctly, run the following command in your terminal:

# Check Docker version
docker --version

# Test with a simple hello-world container
docker run hello-world

There are other ways to run containers, and you can experiment with these as you get more comfortable with Docker.

Running Bioinformatics Tools with Docker

We will use the popular tool samtools as an example to demonstrate how to run bioinformatics tools using Docker. samtools is a widely used tool for working with Sequence Alignment/Map (SAM) and Binary Alignment/Map (BAM) files.

  1. Pull a Docker Image

    Here, we will pull a Docker image for samtools from Docker Hub.

    # Pull the samtools image from Docker Hub
    docker pull biocontainers/samtools
    

    This command will download the samtools image and its dependencies to your local machine. We can then use the image to create containers.

  2. Run a single command non-interactively

    We will use samtools to view the first few lines of a BAM file. Replace /data_dir with the path to the folder containing your BAM file (align.bam)

    # Run samtools view on a BAM file
    docker run --rm biocontainers/samtools -v /data_dir:/data samtools view /data/align.bam | head
    
    • --rm: Removes the container after execution
    • -v /data_dir:/data: Mounts the local directory /data_dir to the container directory /data

    This command is useful for running single commands without needing an interactive shell. The --rm flag ensures that the container is removed after the command finishes.

  3. Run a command interactively

    We can also run an interactive shell within the container to execute multiple commands.

    # Start an interactive shell in the samtools container
    docker run -it --name samtools_container biocontainers/samtools -v /data_dir:/data /bin/bash
    
    • -it: Starts an interactive terminal session
    • /bin/bash: Launches the bash shell in the container

    Here, we also used the --name flag to give the container a name (samtools_container) for easy reference.

    We can now run multiple samtools commands within the container:

    # Check samtools version
    samtools --version
    
    # Index a reference genome
    samtools faidx /data/ref.fa
    

    We can add --rm to the interactive docker run command to remove the container after exiting the shell.

Finding Bioinformatics Tool Containers

These registries host a large number of pre-built Docker images for bioinformatics tools:

Composing Docker Workflows

Docker’s real power shines when we use it compose complex workflows with multiple tools. By chaining together containers, we can create reproducible pipelines that can be easily shared and run on different systems.

Bioinformatics Pipeline using bash scripts

Here’s an example of a pipeline that aligns reads to a reference genome using bwa and processes the output using samtools. These commands can be saved in a bash script for easy execution.

# Pull the bwa and samtools images
docker pull biocontainers/bwa
docker pull biocontainers/samtools

# Run the bwa aligner
docker run --rm -v /data_dir:/data biocontainers/bwa bwa mem /data/ref.fa /data/reads.fq > /data/align.sam

# Run samtools to convert the SAM file to BAM
docker run --rm -v /data_dir:/data biocontainers/samtools samtools view -bS /data/align.sam > /data/align.bam

Bioinformatics Pipeline using docker compose

For more complex workflows, Docker Compose provides a convenient way to define and run multi-container steps with built-in dependency management. The images and commands for the bioinformatic tools can be defined in a docker-compose.yml file, making it easier to manage and reproduce.

Consider the same bioinformatics task from the bash script example above: aligning sequencing reads and converting the resulting alignment to BAM format. The following docker-compose.yml file captures this workflow:

services:
  bwa:
    image: biocontainers/bwa
    volumes:
      - /data_dir:/data
    command: bwa mem /data/ref.fa /data/reads.fq > /data/align.sam
  samtools:
    image: biocontainers/samtools
    volumes:
      - /data_dir:/data
    command: samtools view -bS /data/align.sam > /data/align.bam
    depends_on:
      - bwa

This docker-compose.yml file defines the following key sections:

To run the pipeline, execute the following command in the same directory as the docker-compose.yml file:

docker-compose up

Other Resources

If you’d like to read more about reproducible research practices, check out the following resources:

Docker is a game-changer for bioinformatics, making workflows more reproducible, scalable, and shareable. Whether you’re running a single tool or a complex pipeline, Docker ensures that your research remains reliable and accessible across different environments.

Happy Dockering!