CernVM File System

s17tnet · on Sept 9, 2022

Digging their repos is interesting. They also proposed a docker driver to lazily download image bits as the files in the overlay are accessed [0] from CernVM-FS; they claim significant drop in process start up time.

[0] https://indico.cern.ch/event/567550/contributions/2627182/at...

siscia · on Sept 9, 2022

Yeah I work on that.

The startup time is clearly faster as we don't download the image (especially if compared to download the layers and start the docker image).

The main "trick" is that docker images usually includes a lot of files that are not really accesses during standard operations hence pulling them is not needed most of the time.

s17tnet · on Sept 9, 2022

Yep, I figured it out. I suppose your images are made of large dataset to crunch, for the most part and a smallish part with the R/python/whatever code do execute.

jblomer · on Sept 9, 2022

The data is not part of the images. It's only the software. In the vast majority of cases, any particular data processing job requires only a tiny fraction of the available software. For instance, a few hundred MB out of a few tens of GB for a typical LHC application software release.

yjftsjthsd-h · on Sept 9, 2022

In practice that sounds like an excellent optimization, but in theory it annoys me that we're doing that rather than figuring out how to build better binaries.

jakogut · on Sept 9, 2022

I work on a platform that handles fleets of edge devices running a linux-based OS, where applications are distributed as container images. Nvidia in particular are rather awful to support, as any users with their hardware inevitably build 10+ GB images, largely composed of libraries and samples they'll never use. Plenty of other users are unaware that they can improve the speed and reliability of their deployments by trimming the fat from their images.

A lot of work goes into properly handling and optimizing the download and distribution of excessively large application images, often on slow and unreliable networks, when smaller is always faster and more reliable.

xani_ · on Sept 9, 2022

I'd love that for rescue media, just load what you need and mirror rest of the image to RAM in background

rkeene2 · on Sept 9, 2022

AppFS is similar and I already have a Docker container called "rkeene/appfs" on DockerHub.

fps_doug · on Sept 9, 2022

We developed something similar in-house. For most images it's a notable startup speedup.

chvish · on Sept 9, 2022

Mind sharing what your in-house solution is? I have been working on something similar with extracted layers on AFS and using Podman’s additional layer store.

harvie · on Sept 9, 2022

FUSE filesystem that can mount remote HTTP storage and cache requests localy?

I've been using this like 10 years ago to do just that: http://lftpfs.sourceforge.net/

krnlpnc · on Sept 9, 2022

Yes I was wondering what is the benefit of cernvm filesystem vs other HTTP based fuse mounts (http, swift/s3)?

tempay · on Sept 9, 2022

* file catalog and associated metadata come from SQLite files so listing directories and stating files is fast

* data is chunked and de-duplicated

* catalogues are signed so you can use untrusted HTTP proxies and still ensure integrity

jblomer · on Sept 9, 2022

A key difference is that the file system contents is preprocessed into content-addressable storage (somewhat similar to the format in a .git folder). Also a number of features and optimizations to make it work well as a shared software area, which is characterized by many small, often memory-mapped files and a meta-data heavy access pattern.

arjvik · on Sept 9, 2022

At first I was wondering how this was different from something like Ceph or GlusterFS.

Looks like this isn't meant to be used on a LAN to squeeze out every bit of performance, but rather to work across hosts on different local networks over the whole internet (even passing through NATs) in a somewhat distributed fashion. Cool!

ephimetheus · on Sept 9, 2022

Exactly. This is the software delivery mechanism for the LHC experiments. It’s a read only globally distributed (sort of) POSIX filesystem.

We use it in ATLAS to deploy both tagged releases and nightly builds. These can be about 100GB in size each.

The releases are used for production jobs across the worldwide grid, while the nightlies are used as development targets for individual developers.

avidphantasm · on Sept 9, 2022

Sounds like a read-only version of AFS [1], which had a global namespace and was used for software distribution as well as data sharing/collaboration. If you narrow the use case to software distribution, I suppose read only is a reasonable trade off if it enables higher performance.

[1] https://en.m.wikipedia.org/wiki/Andrew_File_System

tempay · on Sept 9, 2022

The biggest thing is that it’s read-only[1] and aims for high performance massively distributed reads (millions of clients).

[1] Though for publishing files you have have many concurrent transactions open from different machines so long as they lock unique paths.

pengaru · on Sept 9, 2022

Erm, it's not even a read-write filesystem.

It appears to simply be a read-only POSIX-fs interface to Merkle-encapsulated bits distributed via http.

Ceph and Gluster are completely different animals.

zcw100 · on Sept 9, 2022

It sounds more like IPFS

skissane · on Sept 9, 2022

Many years ago, CERN had a (very heavily used) IBM mainframe called CERNVM, so named because it ran VM/CMS.

Is CernVM’s similar naming homage or coincidence?

jblomer · on Sept 9, 2022

The name is a play on the old CERNVM mainframe.

The project started around 2007/2008 and produced a virtual machine image to run LHC experiment applications in the cloud. In order to keep image sizes manageable, we looked into network file systems that could distribute the application software (including AFS/Conda, HTTP-FUSE [1], Igor-FS [2], GROW-FS [3]). None of them ticked all the boxes, so we developed CernVM-FS (CVMFS), which then developed a life beyond CernVM.

[1] https://kernel.org/doc/ols/2006/ols2006v2-pages-387-400.pdf [2] https://indico.cern.ch/event/28823/contributions/658268/atta... [3] https://iopscience.iop.org/article/10.1088/1742-6596/219/6/0...

pjmlp · on Sept 9, 2022

On the cloud?!? So what happened to grid computing? :)

By the way, is the AFS from HLT TDAQ still used anywhere?

traceroute66 · on Sept 9, 2022

Interesting concept, although in this day and age, at least for software distribution, something more robust to nafarious actors like TUF[1] would be my own preference.

Obviously I realise CernVM is not the same thing per-se and that CernVM is more focused on delivering the storage/distribution side rather than the security side. I suppose its understandable as they have their own distributed internal network and so perhaps don't have quite the same extent of trust concerns as others might.

[1] https://theupdateframework.io/

markhahn · on Sept 9, 2022

no, CVMFS is secure. clients have to somehow be provided a repo's key, but every chunk is signed by that key and validated by the client. so intermediate repos (stratum 1) are not security concerns, for instance.

CVMFS specifically exists for the purpose of delivering data securely to isolated VMs - that is, not assuming any site infrastructure like shared filesystems in a cluster. since it's trying to minimize the need for hosting site cooperation, it also normally uses http (not https, etc).

polskibus · on Sept 9, 2022

Has anyone outside of CERN used it on a large scale successfully?

mboisso · on Sept 9, 2022

Compute Canada (now the Digital Research Alliance of Canada, http://alliancecan.ca/) has been using CVMFS in production for nearly 7 years now. We use it to distribute all of the software environment used by our researchers on all of our national clusters, as well as to make it available for researchers themselves and university owned computers, and virtual machines or virtual clusters in the cloud. Documentation on using it on any Linux computer here: https://docs.alliancecan.ca/wiki/Accessing_CVMFS.

CVMFS has been essential in making a uniform user-facing software environment available throughout our distributed infrastructure (we have 5 different clusters located across the country, totalizing over 250k cores, close to 2k GPUs, and something like 200PB of disk storage).

It is a proven geo-distributed redundant and high performance (through local caching) filesystem which enables a install-once-available-everywhere strategy. Our software stack currently hosts over 1150 different scientific software packages in over 2800 versions, and built optimized for multiple CPU architectures, totalizing nearly 10k different builds. This is in addition to nearly 10k python packages which we make available in the form of python wheels.

Our software stack now is close to 8TB in size, and contains tens of millions of files.

tempay · on Sept 9, 2022

It’s very widely used outside of CERN in research/academic contexts. There are definitely commercial users (though I know less about their status).

Next week there is a workshop with talks from some of the users: https://indico.cern.ch/event/1079490/timetable/#20220912.det...

CaliforniaKarl · on Sept 9, 2022

Cool! Do you know, will there be a recording?

jblomer · on Sept 9, 2022

Project owner here. Most likely yes, pending speakers' consent.

CaliforniaKarl · on Sept 9, 2022

Thank you!

rohithkp · on Sept 9, 2022

The Neurodesk Project[1] uses CVMFS to fetch Singularity container images dynamically on-the-fly from a central repository.

[1] https://neurodesk.org

carapace · on Sept 9, 2022

Yeah I saw a (mind-blowing) talk on Neurodesk (and Brainhack Cloud) last night (NeuroTechX Hacknight at Noisebridge in SF https://www.meetup.com/neurotechx-sf/events/288358663/ ) and that's what led me to CVMFS. Neurodesk is nuts (in a good way!) It successfully abstracts away so much of the gnarly details. You need different versions of that obscure program? No problem, have them all, run them concurrently, whatever.

Even if you're not a researcher it's worth checking out for the technical accomplishments.

sanxiyn · on Sept 9, 2022

The design reminds me of Plan 9's Fossil file system, just HTTP (fitting for a project from Cern, I guess) instead of 9P.

rakoo · on Sept 9, 2022

It's more generic 9p-ish in that it is used to distribute binaries. So you'd use /cvmfs/bin/tar, for example, with support for your specific archive format.

whistl034 · on Sept 9, 2022

Just from the list of distros and versions, it looks like this is nothing new. Someone really needs to update that landing page.

breatheoften · on Sept 9, 2022

How/when do changes propagate to clients? Do clients cache the contents they read forever ?

mboisso · on Sept 9, 2022

Clients fetch content from stratum 1 servers as needed. Multiple layers of cache are possible/recommended, including 1) a local SQUID proxy (in the case of large infrastructure) 2) the client's local disk 3) a networked/parallel filesystem

Cache invalidation happens through checksums and catalogs (each revision has one or multiple nested catalogs, content is checksummed, and clients can know if one or multiple catalogs are outdated by comparing checksums they have to upstream checksums).