Dialogue 9/29

This is a dialogue between 🔺 and 🔹. These are mostly produced while thinking on topics I’m working on.

🔺 So what’s the typical user for Xvc? Who’s the probable user?

🔹 I’m thinking of a user with lots of data and will be training machine learning models. They need to version this data.

🔺 What do you mean by data in this context?

🔹 The data in this context are files. Image files, audio files, text files. Bunch of files. Lots and lots of files. They may be organized into directories, they may be reserved for training, verification and testing, they may be stored in multiple locations.

🔺 So, not the database files? We can’t version the data in databases, you say.

🔹 If the database you think of is something like SQLite, then it’s possible to version it. It’s just another file. You can track the file with Xvc and move back and forth between files to go back in time. But I doubt it’s the best way.

🔺Why?

🔹 The database for data is an alternative to the file system. You can version the data by adding a timestamp or a version description, and track individual records. In this case, you can go back and forth with views by selecting a subset of the records. To my mind, it doesn’t make sense to replace the database file itself to track the data version.

🔺 For tracking, yes, there may be other ways to version the data. Databases themselves provide this as well. Excepting SQLite, replacing the database files doesn’t work seamlessly. Databases keep extra indices and auxiliary files for performance and just replacing the database files probably makes it fail. However, these databases may be sources of data in data pipelines. When you have, for example, a database table as a source, you should be able to define it as a dependency in the pipelines.

🔹 This looks worthy, right.

🔺 So, the user can define a database table as a dependency. Suppose you have a table that contains a bunch of data and you want to update your ML model when this data changes. This must be a very common use case.

🔹 It looks so, yes. It’s more difficult than file system dependencies though. At least, it must be customized per database engine. We need different connections to different databases.

🔺This is something that can be abstracted. All databases have tables and queries. We can have a database connection layer to get a table, check whether it has changed and invalidate it as a dependency if it has. The user is responsible to get the actual data and do whatever they do with it anyway.

🔹 It looks like a nice model. We can also adapt it to SQL queries. When queries are run, they produce a result and if this result has changed, we can consider it invalidated. Nice to run a simple query to check whether a complex join has changed and update the models if so.

🔺What would be the performance implications? A query should be run in each of run of the pipeline, I think.

🔹Yep. If we define a database table dependency as xvc pipeline step dependency --database-table then it should connect to the database at each xvc pipeline run and check whether the table has changed. The table change can be detected by selecting all records or via database internals. It’s not that a big deal if the table or query doesn’t produce much results.

🔺 Maybe it’s possible to generalize this behavior to command line as well. Xvc could run a command to check whether something has changed, like ls -R my-data and if the result is different from the previous run, it can run a command. This could be the generic way of pipeline steps.

🔹Looks like a nice idea, yes. We can talk on this tomorrow.