XVC State Machine

2022-12-10

I began to write Xvc's pipeline and dependency handling. The best way to handle the dependency states seems through a state machine. The state machine is a simple abstraction that shows state changes with respect to inputs. It can also have outputs for these state changes. There are some varieties on this but Xvc's SM is a simple one.

I first tried to use rust-fsm library, but it began apparent that XVC pipeline steps' state are tied to XvcRoot, that is if we are to check the presence of a file, or value of a parameter, we have to do it relative to the repository root. The root directory should be taken into consideration in every transition.

I checked the code and I noticed that the FSM is actually a very simple one. I copied it, added &XvcRoot to transition and output functions in the trait definition and implemented it for XvcOutput, XvcDependency and XvcStep.

These are the constituents of a pipeline. A pipeline is composed of XvcStep that defines a command, and each step can have multiple XvcDependency and XvcOutput definitions. For each of these structs, I've added states that show their state.

For example, an XvcOutput can be Missing or Found, Old or Ok. An XvcStep that produces this XvcOutput doesn't check the dependency content hash if an output is missing. However, if XvcOutput is Found, the XvcStepStateMachine checks the XvcDependency states and their modification times to see if they are changed after the output. An XvcStep is invalidated when XvcOutput is Missing or XvcDependency has changed after the last command run.

Unlike DVC, I added the ability for XvcStep to depend on other XvcSteps. They communicate through outputs. I've added XvcDependency::Step(XvcStep) to XvcDependency definition. I'm also planning XvcDependency::Pipeline(XvcPipeline) to make steps to depend on other pipelines, so that pipelines can be run in order.

Currently, the following are included as XvcDependency:

Additionally, I'm planning to add the following items to XVC as dependencies:

Each of these dependencies are checked minimally, that is when their size on disk is detected to be changed, they are considered changed without checking the content hash. It needs a very detailed state machine to track the changes without bugs.

I've noticed that, if I can write such a state machine, most of the IO operations can be done in parallel. If two steps are not depending to each other in the dependency graph, they can be run in parallel. State machine's granularity allows this.