Implementing Docker-in-Docker in Kubernetes

As part of my role, I focus on building bioinformatic and data pipelines. I envisioned a data platform that integrates various third-party open-source tools into data workflows, executable on auto-scaling serverless compute platforms like AWS EKS. This setup would allow us to handpick tools à la carte.

In this post, I'll discuss the setup but won't delve deeply into how workflow orchestration works. Briefly, we have a server for workflow orchestration that pulls from code repositories which define each workflow. These workflows are triggered on a schedule, by sensor detection, or manually through an API or GUI via a web app. Each job initiates a new node on the Kubernetes cluster and creates a Pod that houses the workflow code and the necessary software packages.

This approach is adequate for simple workflows. However, many bioinformatics pipelines require the use of third-party tools and Docker images, such as minimap2, FLASH merge, and Oxford Nanopore tools. These workflows necessitate calling additional Docker containers with separate images to execute specific commands. We could theoretically compile all tools and workflows into one comprehensive Docker image, but integrating all these tools, workflow codes, packages, and orchestration libraries into a single image proves cumbersome and complicates repository management. I found it more streamlined for each workflow operation to invoke its own Docker container as needed.

As you might infer from the title, each workflow runs within a containerized Pod. Therefore, we utilize Docker-in-Docker (DinD). There's a DinD Docker image available (docker:dind) that provides this functionality. However, using it as our base image would mean creating a new image with all our dependencies included.

Instead, we opted for the Kubernetes sidecar pattern. In addition to our primary container, we run a DinD sidecar container. By exposing the Docker daemon's TCP port and sharing the /var/lib/docker volume, we enable Docker-in-Docker functionality within our Pods.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: worker
spec:
 replicas: 4
 template:
   spec:
     containers:
       - name: launcher
         image: "busybox"
         imagePullPolicy: Always
         env:
           - name: DOCKER_HOST
             value: tcp://localhost:2375
         volumeMounts:
           - name: worker-storage
             mountPath: /opt/worker
             subPath: worker
       - name: dind
         image: "docker:dind"
         imagePullPolicy: Always
         command: ["dockerd", "--host", "tcp://127.0.0.1:2375"]
         securityContext:
           privileged: true
         volumeMounts:
           - name: worker-storage
             mountPath: /var/lib/docker
             subPath: docker
           - name: common-storage
             mountPath: /opt/worker
             subPath: worker
     volumes:
       - name: common-storage
         emptyDir: {}