Digging their repos is interesting. They also proposed a docker driver to lazily download image bits as the files in the overlay are accessed [0] from CernVM-FS; they claim significant drop in process start up time.
The startup time is clearly faster as we don't download the image (especially if compared to download the layers and start the docker image).
The main "trick" is that docker images usually includes a lot of files that are not really accesses during standard operations hence pulling them is not needed most of the time.
Yep, I figured it out. I suppose your images are made of large dataset to crunch, for the most part and a smallish part with the R/python/whatever code do execute.
The data is not part of the images. It's only the software. In the vast majority of cases, any particular data processing job requires only a tiny fraction of the available software. For instance, a few hundred MB out of a few tens of GB for a typical LHC application software release.
In practice that sounds like an excellent optimization, but in theory it annoys me that we're doing that rather than figuring out how to build better binaries.
I work on a platform that handles fleets of edge devices running a linux-based OS, where applications are distributed as container images. Nvidia in particular are rather awful to support, as any users with their hardware inevitably build 10+ GB images, largely composed of libraries and samples they'll never use. Plenty of other users are unaware that they can improve the speed and reliability of their deployments by trimming the fat from their images.
A lot of work goes into properly handling and optimizing the download and distribution of excessively large application images, often on slow and unreliable networks, when smaller is always faster and more reliable.
Mind sharing what your in-house solution is? I have been working on something similar with extracted layers on AFS and using Podman’s additional layer store.
A key difference is that the file system contents is preprocessed into content-addressable storage (somewhat similar to the format in a .git folder). Also a number of features and optimizations to make it work well as a shared software area, which is characterized by many small, often memory-mapped files and a meta-data heavy access pattern.
At first I was wondering how this was different from something like Ceph or GlusterFS.
Looks like this isn't meant to be used on a LAN to squeeze out every bit of performance, but rather to work across hosts on different local networks over the whole internet (even passing through NATs) in a somewhat distributed fashion. Cool!
Sounds like a read-only version of AFS [1], which had a global namespace and was used for software distribution as well as data sharing/collaboration. If you narrow the use case to software distribution, I suppose read only is a reasonable trade off if it enables higher performance.
The project started around 2007/2008 and produced a virtual machine image to run LHC experiment applications in the cloud. In order to keep image sizes manageable, we looked into network file systems that could distribute the application software (including AFS/Conda, HTTP-FUSE [1], Igor-FS [2], GROW-FS [3]). None of them ticked all the boxes, so we developed CernVM-FS (CVMFS), which then developed a life beyond CernVM.
Interesting concept, although in this day and age, at least for software distribution, something more robust to nafarious actors like TUF[1] would be my own preference.
Obviously I realise CernVM is not the same thing per-se and that CernVM is more focused on delivering the storage/distribution side rather than the security side. I suppose its understandable as they have their own distributed internal network and so perhaps don't have quite the same extent of trust concerns as others might.
no, CVMFS is secure. clients have to somehow be provided a repo's key, but every chunk is signed by that key and validated by the client. so intermediate repos (stratum 1) are not security concerns, for instance.
CVMFS specifically exists for the purpose of delivering data securely to isolated VMs - that is, not assuming any site infrastructure like shared filesystems in a cluster. since it's trying to minimize the need for hosting site cooperation, it also normally uses http (not https, etc).
Compute Canada (now the Digital Research Alliance of Canada, http://alliancecan.ca/) has been using CVMFS in production for nearly 7 years now. We use it to distribute all of the software environment used by our researchers on all of our national clusters, as well as to make it available for researchers themselves and university owned computers, and virtual machines or virtual clusters in the cloud. Documentation on using it on any Linux computer here: https://docs.alliancecan.ca/wiki/Accessing_CVMFS.
CVMFS has been essential in making a uniform user-facing software environment available throughout our distributed infrastructure (we have 5 different clusters located across the country, totalizing over 250k cores, close to 2k GPUs, and something like 200PB of disk storage).
It is a proven geo-distributed redundant and high performance (through local caching) filesystem which enables a install-once-available-everywhere strategy. Our software stack currently hosts over 1150 different scientific software packages in over 2800 versions, and built optimized for multiple CPU architectures, totalizing nearly 10k different builds. This is in addition to nearly 10k python packages which we make available in the form of python wheels.
Our software stack now is close to 8TB in size, and contains tens of millions of files.
Yeah I saw a (mind-blowing) talk on Neurodesk (and Brainhack Cloud) last night (NeuroTechX Hacknight at Noisebridge in SF https://www.meetup.com/neurotechx-sf/events/288358663/ ) and that's what led me to CVMFS. Neurodesk is nuts (in a good way!) It successfully abstracts away so much of the gnarly details. You need different versions of that obscure program? No problem, have them all, run them concurrently, whatever.
Even if you're not a researcher it's worth checking out for the technical accomplishments.
It's more generic 9p-ish in that it is used to distribute binaries. So you'd use /cvmfs/bin/tar, for example, with support for your specific archive format.
Clients fetch content from stratum 1 servers as needed. Multiple layers of cache are possible/recommended, including
1) a local SQUID proxy (in the case of large infrastructure)
2) the client's local disk
3) a networked/parallel filesystem
Cache invalidation happens through checksums and catalogs (each revision has one or multiple nested catalogs, content is checksummed, and clients can know if one or multiple catalogs are outdated by comparing checksums they have to upstream checksums).
[0] https://indico.cern.ch/event/567550/contributions/2627182/at...