This is a dialogue between 🔺 and 🔹. These are mostly produced while thinking about topics I’m working on.
🔺 So, who is the typical user of Xvc? Who is the probable user?
🔹 I’m thinking of a user with large amounts of data who will be training machine learning models. They need to version this data.
🔺 What do you mean by data in this context?
🔹 The data in this context consists of files. Image files, audio files, text files—a whole bunch of files. Lots and lots of files. They may be organized into directories; they may be reserved for training, verification, and testing; and they may be stored in multiple locations.
🔺 So, not database files? Are you saying we can’t version the data in databases?
🔹 If the database you’re thinking of is something like SQLite, then it is possible to version it. It’s just another file. You can track the file with Xvc and move back and forth between versions to go back in time. But I doubt that’s the best way.
🔺 Why?
🔹 A database is an alternative to the file system for storing data. You can version the data by adding a timestamp or a version description and track individual records. In this case, you can go back and forth between versions by selecting a subset of the records. To my mind, it doesn’t make sense to replace the database file itself just to track the data version.
🔺 For tracking, yes, there may be other ways to version the data. Databases themselves often provide this functionality. Except for SQLite, replacing database files doesn’t work seamlessly. Databases maintain extra indices and auxiliary files for performance, and simply replacing the database files would likely cause them to fail. However, these databases can be sources of data in data pipelines. When you have, for example, a database table as a source, you should be able to define it as a dependency in the pipelines.
🔹 That sounds worthwhile, right.
🔺 So, the user could define a database table as a dependency. Suppose you have a table that contains a lot of data, and you want to update your ML model whenever this data changes. This must be a very common use case.
🔹 It seems so, yes. It’s more difficult than file system dependencies, though. At the very least, it must be customized per database engine. We need different connections for different databases.
🔺 This is something that can be abstracted. All databases have tables and queries. We can have a database connection layer to retrieve a table, check whether it has changed, and invalidate it as a dependency if it has. The user is responsible for getting the actual data and doing whatever they need to do with it anyway.
🔹 It looks like a nice model. We can also adapt it to SQL queries. When queries are run, they produce a result, and if this result has changed, we can consider the dependency invalidated. It would be useful to run a simple query to check whether a complex join has changed and update the models accordingly.
🔺 What would be the performance implications? A query would need to be run in each execution of the pipeline, I think.
🔹 Yep. If we define a database table dependency as xvc pipeline step dependency --database-table, then it should connect to the database during each xvc pipeline run and check whether the table has changed. A table change can be detected by selecting all records or via database internals. It’s not that big a deal if the table or query doesn’t produce too many results.
🔺 Maybe it’s possible to generalize this behavior to the command line as well. Xvc could run a command to check whether something has changed—like ls -R my-data—and if the result differs from the previous run, it can trigger a command. This could be the generic way to handle pipeline steps.
🔹 That sounds like a great idea. We can talk about this tomorrow.