The Python Platform Team at Wayfair maintains a collection of base Docker images used by all deployed Python applications. Maintaining these base images provides us with the ability to:

  • Standardize our environment (all of our images are based on CentOS)
  • Simplify deployment for application developers (images ship pre-configured to connect to our internal PyPi and RPM mirrors)
  • Improve security (we take steps to harden images that will be deployed to production).

We recently refactored our images to make them more efficient and were able to reduce the size of our images by over 50%. These improvements are not just developer-friendly optimizations: in addition to decreasing image build time and storage requirements, many optimizations make images more secure, as reducing image complexity reduces the overall attack surface area. Read on to see how we achieved this great result.

Optimization Tools

We use and recommend Dive to profile images. It provides a graphical interface for quickly exploring image layers and calculates an “efficiency score” that highlights images with wasted space.

Below is an example Dive report for the image centos:centos7, which is similar to the foundational CentOS image used at Wayfair.

In comparison, here is the Dive output for one of our un-optimized base images running Python 3.8 on CentOS 7.5: a whopping 931MB (as a basis of comparison, the total size of the open-source python:3.8.5 image is 882MB).

The layer inherited from the CentOS base image is highlighted in yellow, while the largest layers introduced in the Python image are highlighted in red. Dive shows the command that generated the selected layer (highlighted in white).

To optimize our images, we stepped through each of the largest layers to identify those that could be eliminated or trimmed. Below is the Dive report for the optimized version of the image.

Following optimization, the final image is 414MB, a reduction of over 50%! We achieved this reduction using the following strategies.

Optimization Strategies

When installing packages, clean up in the same layer

The largest layer in the un-optimized image above was 324MB and resulted from the command yum -y install libcurl. The layer is so large because yum commands generate a cache, introducing significant cruft.

Therefore, when running yum install, prefer installing all packages in a single command and always delete the cache in the same step.

FROM centos:centos7

# Avoid this
RUN yum install -y foo
RUN yum install -y bar

# Better, but still bad
RUN yum install -y \
    foo \
    bar

# Prefer this
RUN yum install -y \
    foo \
    bar \
    && yum clean all \
    && rm -rf /var/cache/yum

By updating the command to yum -y install libcurl && yum clean all && rm -rf /var/cache/yum, the layer size decreased from 324MB to 23MB, a 93% reduction.

The same principle applies to package managers for other languages; most are optimized for local development over size-conscious Docker images. Always attempt to install all packages in a single step and disable the package manager’s cache. For example:

Use multistage builds when building or compiling code

We compile the Python distribution included in our base images from the source, which requires a number of build dependencies. In total, these dependencies are over 200MB and are only used during compilation. Multistage builds prevent these dependencies from shipping in the final image, improving image size and decreasing the attack vector introduced by shipping the GNU toolchain and other binaries in deployed images.

Instead of this:

FROM python-base-image:3.8.5

# Install some system dependencies necessary for building python packages
RUN yum install -y \
   gcc \
   && yum clean all \
   && rm -rf /var/cache/yum

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY application-code/ .

ENTRYPOINT [ "python" ]
CMD [ "script.py" ]

Prefer this:

FROM python-base-image:3.8.5 AS builder

# Install some system dependencies necessary for building python packages
RUN yum install -y \
   gcc \
   && yum clean all \
   && rm -rf /var/cache/yum

# Create a virtualenv to keep dependencies together
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install dependencies into virtual environment
COPY requirements.txt .
RUN pip install -r requirements.txt


FROM python-base-image:3.8.5

# Copy over only virtualenv
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY application-code/ .

ENTRYPOINT [ "python" ]
CMD [ "script.py" ]

Using a virtualenv in a docker image as in the example above may seem counterintuitive given that docker containers are inherently isolated, but they are useful in making it easy to copy all Python dependencies between build stages. Be careful with this approach when pip installing packages with dynamically linked dependencies (e.g. where the package depends on a C library that is not bundled with the package itself), as these dependencies must also be installed in the final image, not just in the intermediary build step.

Avoid RUN chmod and RUN chown steps

When a command to change file permissions or file ownership is executed, Docker creates a duplicate of the file or directory. This can result in significant bloat. Avoid this by chowning in the same step that content is copied into the image:

FROM centos:centos7

# Don't do this
COPY ./ /app
RUN chown -R 1001:100 /app

# Prefer this
COPY --chown=1001:100 ./ /app

Docker has recently implemented a chmod option for COPY, but it is only available when the Buildkit backend is enabled. When using the traditional build backend, either set the permission mode before the file is copied into the image or use a multistage build where chmod commands are only run in the intermediary image.

We were able to eliminate 185MB from our images by eliminating chmod commands.

Think critically before copying application data into images

Many dockerfiles we encountered include a COPY ./ ./step that copies all application data into the image as one of the final steps. While convenient, this can result in unnecessary cruft in production images (test suites, documentation, etc.).

Consider including a .dockerignore file in the project to purposefully exclude unnecessary files from being included in the final image.

Conclusion

If you found this interesting, you may enjoy reading about Wayfair’s innovative approach to building a Python platform team, or listening to this podcast.__init__ episode about Python at Wayfair. If you’d like to learn more, please reach out! We are currently hiring, and are always looking for Senior Pythonistas to join our team.