Differences between DVC and Xvc

Posted on 2024-07-17 :: Tags: xvc, dvc, mlops, data-versioning, rclone, s5cmd

I wrote this in Reddit, let’s put it here too.

Full list of similarities and differences is rather long. Let me summarize it.

Xvc has different commands, xvc file track is used instead of dvc add. Xvc doesn’t add files (like .dvc files) to your repository and keeps all metadata tracking under .xvc directory. Checkout method is per-file, not configured globally, so you can keep track of your data directory with symlinks and your model directory as copies. Xvc uses BLAKE3 as default hashing algorithm and you can configure this to be BLAKE2, SHA-2 or SHA-3.

Pipelines are not defined using YAML files. You can write a shell script with xvc pipeline step ... or use Python xvc.pipeline().step().dependency(step_name="preprocess", param="hyperparams.yaml::batch_size") to define pipelines first. Then you can use xvc pipeline export and xvc pipeline import to modify the pipeline in YAML, JSON or TOML.

There are more dependency options, e.g., a pipeline step may depend to a text file partially, by regex or by the line options. There is a generic dependency option, output of a shell command can be used as a dependency to a step.

Remote storage options are rather limited for Xvc, local, ssh+rsync and S3 compatible storages are supported for now. There is also generic storage option where you can define upload and download commands for the tool you’re using, e.g. rclone or s5cmd, and Xvc can use it. I’ll add Azure and rclone as natively supported storage options eventually but I don’t like the idea of keeping credentials, so there won’t be any OAuth-required storage options, e.g., Google Drive. (You’ll be able to use these through rclone, though.) All Xvc storages use environment variables for authentication.

Xvc doesn’t have experiment tracking yet. You can use --from-ref and --to-branch options to store artifacts from the pipeline to different branches. I’ll add features to run pipelines and commands quickly and compare these eventually (I need one too) but it may take some time.

Xvc doesn’t track anything about the user. It shouldn’t make any network connections if you didn’t specifically asked to do. I’m planning to add binaries that only do file operations or pipeline operations, so if someone doesn’t need pipeline features they will simply use xvc-file.