A Brief History of Xvc

Posted on 2024-01-22 :: Tags: xvc, history, dvc

In the first months 2021, I decided to return to life after a long legal battle for divorce. Covid was still on the rage. I wasn’t keen to start a company or work in my country due to my half-deaf ears. I decided to find some open source projects and contribute, maybe get recognition, maybe hired.

I saw an ad on Stack Overflow Jobs those days about employment by contributing to open source projects. I applied to that. A few weeks later, CTO of [iterative.ai] got in touch and I started work for DVC documentation. First per hour basis, and after May 2021, as a full-time employee.

Initially, I liked the tool we were building very much. The team was awesome. (Still, they are.) It was one of the best periods of my life, especially in my turbulent still-ongoing-divorce-period pressures. I know I will always miss them.

My job was learning DVC, documenting it, making newcomers to grasp it easily. It was a fun job. Until then, I didn’t see myself as a technical writer. English is not my native tongue and I never have lived in an English-speaking country. Nevertheless, I think I wasn’t too bad in it.

When I was first learning the tool, I began to use it everywhere. I was an avid user of Git Annex once. DVC looked better. I don’t remember why I lost interest in Git Annex after many years, but it was probably related with symbolic links not working on Windows, (or on Termux.) DVC had multiple ways of connecting cache and the files in the workspace, including hardlinks and copy, so it was fresh a air for me.

I began to use it for my large collections. Keeping track of my binary files in Git was something I always desired. Git is the least sucking version control system among the ones I used previously (SVN, hg, darcs…) and I’d rather keep using it everywhere rather than learning new tools for binary files.

After some time I began to use the tool for my personal file collections. I noticed its performance becomes a burden. I was tracking maybe a few gigabytes of files with it and basic file operations become slower as I add more. I noticed I’m becoming distracted after I write a dvc command. It took some time to confess that the tool I liked once and earning my salary with was not a tool that I like to use.

I don’t know what real professionals would do at this point. I never had a good LinkedIn profile. When I met a similar problem with the example repository that’s supposed to contain 70000 small files, I brought the issue forward. I wrote a shell script that’s basically doing the same thing as dvc add and it worked much faster than the actual command. The shell script was naïve and I thought DVC must have at least that level of speed. It didn’t. Simply calling md5sum on files and copying them to appropriate location in .dvc/cache was way faster. How could this be?

I have cursory observations on the code base. I know some decisions (like a large central class that connects everything, separate .dvc files for each tracked file) that may lead to degradation. Although I don’t see it as the problem, Python is also not helpful. These are rough observations.

It was September 2021. I was also teaching myself Rust. I wrote an email to CTO and CEO of the company to request a sabbatical to work on DVC. My plan was to rewrite certain portions (or commands) in Rust and wrap tham with PyO3. It could fail. So to have skin in the game, I said I’ll work for free during this time and if I fail to make DVC faster for some reason, I’ll return to my writing position.

They didn’t accept. I didn’t try to persuade them. The decision was rational and although I’d say go ahead and see what happens if I were in their shoes just to make my employee happy, they aren’t crazy-managers as I once was. Probably there are many factors that I’m not aware of. I returned to my post and continue to write documentation for another 9 months. In the meantime I studied Rust and thought how could I architect a similar tool. Where does DVC go wrong?

In April 2022, I informed the CTO that I’d like to take a sabbatical for my book. I have a political-SF book and after the Ukranian war started with a (albeit minor) probability of nuclear attack on the other shore of Black Sea, I thought it’s not a time to work on something I stopped liking. My performance in the last quarter was also not something I was proud of. I didn’t feel good.

When I retired to sabbatical in July though, while writing the book, I thought writing the software that I wanted to see is also something in my mind before nuclear war. I had notes about the architecture I was planning. I wanted to see if I could apply an Entity-Component System to this basic problem, without any Object-Oriented conceptions. I believe it looks cool. I’m still simplifying and testing the idea, and it looks better to my mind than mixing data and functions for no reason.

After I made the repository public, I resigned from Iterative.

In a sense, Xvc owes its existence to DVC, and the name is a tribute to this. I hope they squash their bugs, and improve their user experience, and be a long-term player in the crowded market they are in. I don’t intend to be a “competitor”, because I prefer being developer/architect rather than a “technical steward to VC money”, and the license of Xvc is GPL-3 to signal this.

The Xvc command line interface, however, is as different as it can be from DVC. The command names are different, DVC has similar commands with Git, (push, fetch, pull, commit), Xvc tries to be different from Git to reduce user’s mental load. For example, as a writer, I noticed that “Git remotes” and “DVC remotes” was confusing, so I called them “Xvc storages”. DVC calls the units of a pipeline stages, the same concept is called steps in Xvc, because stage in Git is something completely different.

Internally, the architecture is also very different. Xvc uses serialization (with serde) instead of YAML. It can export/import pipelines from YAML (or JSON), but YAML is not as central as in DVC. (I believe YAML is overused in our industry, and it’s a employment guarantee for another generation of developers but there are better work than keeping up a half-baked configuration format.) Xvc doesn’t keep its artifacts in user’s workspace, (except .xvcignore files). They are all stored in .xvc/ directory. DVC way of doing things makes merge of .dvc files easier. To overcome the problems caused by merging large metadata files, Xvc keeps track of events and replays them to get the final state of repository. All metadata storage and retrieval operations revolve around XvcStore<T> struct in Xvc. Typically, if user runs an xvc command, only the updated store events (added files, changed pipelines, etc.) are stored. There are optimizations in this front but I profile first and optimize later.

Algorithms for data digests are configurable, by default Xvc uses Blake3, but it is configurable to use SHA2-256, SHA3-256, or Blake2s. It can be modified to use any 256-bit digest quickly. There are some features that’s not found in DVC, and more will come. So, although I’m solving a similar problem, Xvc is not “DVC rewritten in Rust,” it’s a different tool completely.

Currently it doesn’t have eye candy as much as DVC. In time, I plan to add Python, Julia, R APIs, notebook integration, experiment tracking (without relying on Git internals), data labeling and filtering, and other MLOps features. I’m building with a goal to make these features available without making the rest of the software slower.

I’ve found the tool I was looking for to track my kids’ photos and Ottoman OCR datasets in a Git repository. I’m tracking more than 1TB of files in a single repository with Xvc and adding another 10TB looks feasible now.