Last week I posed this question on Twitter:
Docker experts. How do you deal with non-os deps e.g. pip/npm/bower and avoiding cache-busting the dep install layer on dep changes? #docker— Stuart Colville (@muffinresearch) January 30, 2015
Unfortunately Twitter isn't the best medium to provide the necessary context so here's a bit more info on what I meant.
One of the problems we've seen using docker to build out a development environment is that it's required a lot of working around limitations of the environment. Dependencies aren't the only problem but they're proven to be something of a pain.
Dep changes require the dependency installation to be run from scratch.
The FFOS marketplace as an app comprises lots of services that are loosely coupled. This means we can run separate containers and use links to tie things together. We also have a lot of modular code and libs that are not OS packages.
We specify our non-OS packaged deps using requirements.txt files for Python and package.json for nodejs modules. We then add those files via the Dockerfile using an
ADD instruction e.g:
ADD requirements /pip/requirements
Next the deps are installed:
RUN pip install -r /pip/requirements/dev.txt
Now due to how the AUFS works the
ADD is cached so if the requirements.txt file stays the same re-building that part is a no-op. Great!
But, when we do bump a package version or add a new dependency the cache is invalidated and the entire installation of deps is re-run from scratch.
As we have quite a large collection of deps this can a) take a long time and b) can break if the package repository is busy/down/having a bad day.
We'd like a new dep or a dep change to not require all the network access and time it takes to re-install all the dependencies from scratch.
After-all this is a development environment and having to build the world after some dep changes/version bumps gets in the way.
I've been thinking about some alternatives but nothing yet has seemed particularly attractive.
Proxying npm/pypi etc
One possibility would be to use a proxy to cache the package downloads to minimise the network overhead. This would help but there's a few nits. It's another moving part. You still need to pull the packages and install them. So this would alleviate the network part of the problem but that's really only half the problem.
Caching the deps on the filesystem
I looked to see if caching the deps would make a difference. Unfortunately persistent storage via volumes isn't any use at build time. As volumes are only added at run-time you can't utilise the cache when the deps installation is re-run.
Externalizing the deps
One more recent thoughts I had was whether we could move the non-OS packaged deps out of the build-step and into run-time. This way we could use a data-only container for the deps and if the deps change you'll only pull in what's changed.
The downside is that this feels like it's starting to defeat the point of a container-based approach.
Andy Mckay has been trying out using automated hub builds as the trees are updated. This works quite well but there's still a few nits. Bumps in deps require downloading new layers to images. Larger files instead of lots of packages is a step in the right direction but it still feels like it gets in the way. We also need a good way to keep builds aligned with the src revision otherwise you might have the wrong deps for the branches src revision.
What's everyone else doing?
I'm sure we're not the only people trying to do this. How do you overcome these issues in your use of Docker? Or do you just accept the hit in terms of builds taking a long time? Have you found some clever way to manage deps so devs can get up to date quickly without getting buried in a ton of docker-shaped yaks?
Could Docker solve this?
Maybe there could be a way to allow certain paths to not be cache-busted? E.g. could we have a way to allow the pip cache to remain intact so that it can be used when the deps installation is run when the requirements.txt changes?