I began to write Xvc's pipeline and dependency handling. The best way to handle the dependency states seems through a state machine. The state machine is a simple abstraction that shows state changes with respect to inputs. It can also have outputs for these state changes. There are some varieties on this but Xvc's SM is a simple one.

I first tried to use rust-fsm library, but it began apparent that XVC pipeline steps' state are tied to XvcRoot, that is if we are to check the presence of a file, or value of a parameter, we have to do it relative to the repository root. The root directory should be taken into consideration in every transition.

I checked the code and I noticed that the FSM is actually a very simple one. I copied it, added &XvcRoot to transition and output functions in the trait definition and implemented it for XvcOutput, XvcDependency and XvcStep.

These are the constituents of a pipeline. A pipeline is composed of XvcStep that defines a command, and each step can have multiple XvcDependency and XvcOutput definitions. For each of these structs, I've added states that show their state.

For example, an XvcOutput can be Missing or Found, Old or Ok. An XvcStep that produces this XvcOutput doesn't check the dependency content hash if an output is missing. However, if XvcOutput is Found, the XvcStepStateMachine checks the XvcDependency states and their modification times to see if they are changed after the output. An XvcStep is invalidated when XvcOutput is Missing or XvcDependency has changed after the last command run.

Unlike DVC, I added the ability for XvcStep to depend on other XvcSteps. They communicate through outputs. I've added XvcDependency::Step(XvcStep) to XvcDependency definition. I'm also planning XvcDependency::Pipeline(XvcPipeline) to make steps to depend on other pipelines, so that pipelines can be run in order.

Currently, the following are included as XvcDependency:

  • File: A (binary or text) file in the repository. If the metadata (size and modification time), or the content changes, the dependent step becomes invalidated.
  • Directory: A directory that contains files. If a file added or removed from the directory, or any of the files are changed, the associated step becomes invalidated.
  • Glob: A glob in the sense my-data/*.png. If the list of files change, or their content has changed, the associated step becomes invalidated.
  • Parameter: XVC can parse YAML, TOML and JSON files, and get the values of variables. It's possible to define these (hyper)parameters as dependencies.
  • URL: An HTTPS URL, which is checked firstly by metadata and then by content to see whether it's been changed.
  • Step: A previously defined step, if it's invalidated, the depending step also becomes invalidated.

Additionally, I'm planning to add the following items to XVC as dependencies:

  • Lines {path, begin, end}: Lines in a text file. It can be used for general purpose input tracking. If the given lines in a file are changed, the dependent step becomes invalidated.
  • Regex {path, regex}: If the regular expression result on the file changes, the dependent stage becomes invalidated.
  • Pipeline { name }: If any of the steps in a pipeline is invalidated, the pipeline is also invalidated, or the step that depends on this pipeline becomes invalidated.

Each of these dependencies are checked minimally, that is when their size on disk is detected to be changed, they are considered changed without checking the content hash. It needs a very detailed state machine to track the changes without bugs.

I've noticed that, if I can write such a state machine, most of the IO operations can be done in parallel. If two steps are not depending to each other in the dependency graph, they can be run in parallel. State machine's granularity allows this.