XVC State Machine

Posted on 2022-12-10 :: Tags: xvc, rust, finite state machines, state machines, pipeline, programming

I began writing Xvc’s pipeline and dependency handling. The best way to handle dependency states seems to be through a state machine. A state machine is a simple abstraction that shows state changes with respect to inputs. It can also have outputs for these state changes. There are some varieties of this, but Xvc’s state machine (SM) is a simple one.

I first tried to use the rust-fsm library, but it became apparent that Xvc pipeline steps’ states are tied to XvcRoot. That is, if we are to check the presence of a file or the value of a parameter, we have to do it relative to the repository root. The root directory should be taken into consideration in every transition.

I checked the code and noticed that the FSM is actually very simple. I copied it, added &XvcRoot to the transition and output functions in the trait definition, and implemented it for XvcOutput, XvcDependency, and XvcStep.

These are the constituents of a pipeline. A pipeline is composed of an XvcStep that defines a command, and each step can have multiple XvcDependency and XvcOutput definitions. For each of these structs, I’ve added fields that represent their current state.

For example, an XvcOutput can be Missing, Found, Old, or Ok. An XvcStep that produces this XvcOutput doesn’t check the dependency content hash if an output is missing. However, if an XvcOutput is Found, the XvcStepStateMachine checks the XvcDependency states and their modification times to see if they have changed since the output was generated. An XvcStep is invalidated when an XvcOutput is Missing or an XvcDependency has changed after the last command run.

Unlike DVC, I added the ability for an XvcStep to depend on other XvcSteps. They communicate through outputs. I’ve added XvcDependency::Step(XvcStep) to the XvcDependency definition. I’m also planning XvcDependency::Pipeline(XvcPipeline) to allow steps to depend on other pipelines, so that pipelines can be run in order.

Currently, the following are included as XvcDependency:

File: A (binary or text) file in the repository. If the metadata (size and modification time) or the content changes, the dependent step becomes invalidated.
Directory: A directory that contains files. If a file is added to or removed from the directory, or any of the files are changed, the associated step becomes invalidated.
Glob: A glob such as my-data/*.png. If the list of files changes or their content has changed, the associated step becomes invalidated.
Parameter: Xvc can parse YAML, TOML, and JSON files and get the values of variables. It’s possible to define these (hyper)parameters as dependencies.
URL: An HTTPS URL, which is checked first by metadata and then by content to see whether it has changed.
Step: A previously defined step; if it’s invalidated, the depending step also becomes invalidated.

Additionally, I’m planning to add the following items to Xvc as dependencies:

Lines {path, begin, end}: Lines in a text file. This can be used for general-purpose input tracking. If the given lines in a file are changed, the dependent step becomes invalidated.
Regex {path, regex}: If the regular expression result on the file changes, the dependent stage becomes invalidated.
Pipeline { name }: If any of the steps in a pipeline is invalidated, the pipeline is also invalidated, or the step that depends on this pipeline becomes invalidated.

Each of these dependencies is checked minimally; that is, when their size on disk is detected to have changed, they are considered changed without checking the content hash. It needs a very detailed state machine to track the changes without bugs.

I’ve noticed that if I can write such a state machine, most of the I/O operations can be done in parallel. If two steps do not depend on each other in the dependency graph, they can be run in parallel. The state machine’s granularity allows this.