Posted on :: Tags: , , , , , , , , , ,

In the first months of 2021, I decided to return to life after a long legal battle for divorce. Covid was still raging. I wasn’t keen to start a company or work in my country due to my half-deaf ears. I decided to find some open source projects and contribute, maybe get recognition, maybe hired.

I saw an ad on Stack Overflow Jobs those days about employment by contributing to open source projects. I applied to that. A few weeks later, the CTO of [iterative.ai] got in touch and I started working on DVC documentation. Initially on a per-hour basis, and after May 2021, as a full-time employee.

Initially, I liked the tool we were building very much. The team was awesome. (Still, they are.) It was one of the best periods of my life, especially in my turbulent still-ongoing-divorce-period pressures. I know I will always miss them.

My job was learning DVC, documenting it, and helping newcomers grasp it easily. It was a fun job. Until then, I didn’t see myself as a technical writer. English is not my native tongue and I never have lived in an English-speaking country. Nevertheless, I think I wasn’t too bad at it.

When I was first learning the tool, I began to use it everywhere. I was an avid user of Git Annex once. DVC looked better. I don’t remember why I lost interest in Git Annex after many years, but it was probably related to symbolic links not working on Windows (or on Termux). DVC had multiple ways of connecting the cache and the files in the workspace, including hardlinks and copy, so it was a breath of fresh air for me.

I began to use it for my large collections. Keeping track of my binary files in Git was something I always desired. Git is the least sucking version control system among the ones I used previously (SVN, hg, darcs…) and I’d rather keep using it everywhere rather than learning new tools for binary files.

After some time I began to use the tool for my personal file collections. I noticed its performance became a burden. I was tracking maybe a few gigabytes of files with it and basic file operations became slower as I added more. I noticed I was becoming distracted after I wrote a dvc command. It took some time to confess that the tool I liked once and was earning my salary with was not a tool that I liked to use.

I don’t know what real professionals would do at this point. I never had a good LinkedIn profile. When I met a similar problem with the example repository that’s supposed to contain 70,000 small files, I brought the issue forward. I wrote a shell script that was basically doing the same thing as dvc add and it worked much faster than the actual command. The shell script was naïve and I thought DVC must have at least that level of speed. It didn’t. Simply calling md5sum on files and copying them to appropriate location in .dvc/cache was way faster. How could this be?

I had cursory observations on the codebase. I know some decisions (like a large central class that connects everything, separate .dvc files for each tracked file) that may lead to degradation. Although I don’t see it as the problem, Python was also not helpful. These are rough observations.

It was September 2021. I was also teaching myself Rust. I wrote an email to the CTO and CEO of the company to request a sabbatical to work on DVC. My plan was to rewrite certain portions (or commands) in Rust and wrap them with PyO3. It could fail. So to have skin in the game, I said I’ll work for free during this time and if I fail to make DVC faster for some reason, I’ll return to my writing position.

They didn’t accept. I didn’t try to persuade them. The decision was rational and although I’d say go ahead and see what happens if I were in their shoes just to make my employee happy, they aren’t crazy-managers as I once was. Probably there are many factors that I’m not aware of. I returned to my post and continued to write documentation for another 9 months. In the meantime I studied Rust and thought about how I could architect a similar tool. Where does DVC go wrong?

In April 2022, I informed the CTO that I’d like to take a sabbatical for my book. I have a political-SF book and after the Ukrainian war started with a (albeit minor) probability of nuclear attack on the other shore of Black Sea, I thought it’s not a time to work on something I stopped liking. My performance in the last quarter was also not something I was proud of. I didn’t feel good.

When I retired to sabbatical in July though, while writing the book, I thought writing the software that I wanted to see was also something on my mind before nuclear war. I had notes about the architecture I was planning. I wanted to see if I could apply an Entity-Component System to this basic problem, without any Object-Oriented conceptions. I believe it looks cool. I’m still simplifying and testing the idea, and it looks better to my mind than mixing data and functions for no reason.

After I made the repository public, I resigned from Iterative.

In a sense, Xvc owes its existence to DVC, and the name is a tribute to this. I hope they squash their bugs, and improve their user experience, and be a long-term player in the crowded market they are in. I don’t intend to be a “competitor”, because I prefer being developer/architect rather than a “technical steward to VC money”, and the license of Xvc is GPL-3 to signal this.

The Xvc command line interface, however, is as different as it can be from DVC. The command names are different; DVC has commands similar to Git (push, fetch, pull, commit), while Xvc tries to be different from Git to reduce the user’s mental load. For example, as a writer, I noticed that “Git remotes” and “DVC remotes” was confusing, so I called them “Xvc storages”. DVC calls the units of a pipeline stages; the same concept is called steps in Xvc, because stage in Git is something completely different.

Internally, the architecture is also very different. Xvc uses serialization (with serde) instead of YAML. It can export/import pipelines from YAML (or JSON), but YAML is not as central as in DVC. (I believe YAML is overused in our industry, and it’s an employment guarantee for another generation of developers but there is better work than keeping up a half-baked configuration format.) Xvc doesn’t keep its artifacts in the user’s workspace (except .xvcignore files). They are all stored in the .xvc/ directory. The DVC way of doing things makes merging .dvc files easier. To overcome the problems caused by merging large metadata files, Xvc keeps track of events and replays them to get the final state of the repository. All metadata storage and retrieval operations revolve around the XvcStore<T> struct in Xvc. Typically, if the user runs an xvc command, only the updated store events (added files, changed pipelines, etc.) are stored. There are optimizations in this front, but I profile first and optimize later.

Algorithms for data digests are configurable; by default Xvc uses Blake3, but it is configurable to use SHA2-256, SHA3-256, or Blake2s. It can be modified to use any 256-bit digest quickly. There are some features that are not found in DVC, and more will come. So, although I’m solving a similar problem, Xvc is not “DVC rewritten in Rust,” it’s a different tool completely.

Currently, it doesn’t have as much eye candy as DVC. In time, I plan to add Python, Julia, and R APIs, notebook integration, experiment tracking (without relying on Git internals), data labeling and filtering, and other MLOps features. I’m building with a goal to make these features available without making the rest of the software slower.

I’ve found the tool I was looking for to track my kids’ photos and Ottoman OCR datasets in a Git repository. I’m tracking more than 1TB of files in a single repository with Xvc and adding another 10TB looks feasible now.