Developing a gitignore crate

2022-06-09

I needed a file system ignore library for a utility I write. This is similar to Git's .gitignore, but the files containing the rules can have different names, and ignore rules may be defined programmatically.

I was using burntsushi's ignore crate, which in turn uses globset by the same author. In a directory hierarchy, I was first collecting all .ignore files, then trying to decide which files are ignored. Rg Gitignore's from parameter led me to believe this.

However, it looks, it has a bug that doesn't consider the source directory of ignore file. Git has different semantics when a/.gitignore and a/b/.gitignore has a line mydir/. In the second case, Git doesn't ignore a/mydir/ but Ripgrep seems to consider all elements from all files as if they are from the root directory.

In practice, most people may be using ignore files at their root directory but there are valid use cases (like moving around directories with ignore patterns in it) that is affected by such behavior.

I decided to tackle with this. Ripgrep's globset seems nice and fast, so better to base the solution on top of that. My basic use case is walking a directory hierarchy. There are solutions to this like walkdir or jwalk but they use ripgrep's gitignore, hence they have the same bug.

I usually begin to set up data structures to understand the problem. This allows me to understand the domain formally.

First, what are we trying to accomplish? Basically, we want to decide if a path should be ignored (a) in the results and (b) in the directory traversal. Gitignore specification not only allows ignores, but also has a way to add paths to a whitelist to unignore. Hence deciding whether to ignore a path is like:

let include = |path| {
    if whitelist_globs.match(path) {
       true 
    } else if ignore_globs.match(path) {
       false 
    } else { 
       true }
}

We also need to decide whether to traverse to a directory:

let traverse = |path| {
    if path.is_dir() && !include(path) { false } else { true }
}

We have the following possible MatchResults between the ignore rules and a path:

pub enum MatchResult {
    NoMatch,
    Ignore,
    Whitelist,
}

We need a comparison function to get a MatchResult. One side of the comparison is path, and the other is a set of GlobSets from globset crate. We'll have two globsets, one for ignores and the other for whitelists.

pub struct IgnoreRules {
    root: PathBuf,

    patterns: Arc<Vec<Pattern<Glob>>>,

    whitelist_set: Arc<GlobSet>,
    ignore_set: Arc<GlobSet>,
}

The data types for the sets are Arc<GlobSet>, as I plan to make the traversal concurrent for each directory. However, as only one thread will be modifying the rules (one .gitignore or .ignore file per directory), I felt no need to wrap the sets with Mutex.

We also keep track of each pattern separately. patterns is a Pattern vector that contain all globs individually. One reason is to rebuild the GlobSet as we traverse directories: We cannot add any more rules to the sets once we have built, we are keeping the patterns separately to add them to sets of the child directories.

Another reason is to keep track of the source, as this can be used for debugging or reporting, similar to git check-ignore.

root is the point where this traversal began.

Patterns are defined in a generic way to instantiate with globs, strings or regexes or any other type.

pub struct Pattern<T> {
    pattern: T,
    original: String,
    source: Source,
    effect: PatternEffect,
    relativity: PatternRelativity,
    path_kind: PathKind,
}

The rationale for this generic approach is that the patterns need some kind of transformation. They can be read from files as strings, then compiled into globs or regexes.

flowchart LR 
     String-->Ok(Glob)-->Glob-->Regex

source field shows where this pattern is retrieved. It's an enum with two possible values for the time being:

enum Source {
    File { path: PathBuf, line: usize },
    Global,
}

Gitignore specification says:

An optional prefix "!" which negates the pattern; any matching file excluded by a previous pattern will become included again.

We use effect field is to tag the pattern either as Ignore or a Whitelist pattern.

pub enum PatternEffect {
    Ignore,
    Whitelist,
}

The spec also makes a distinction between files and directories:

If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.

So we use path_kind to reflect this distinction:

pub enum PathKind {
    Any,
    Directory,
}

Some patterns can be relative to the directory they are found:

If there is a separator at the beginning or middle (or both) of the pattern, then the pattern is relative to the directory level of the particular .gitignore file itself. Otherwise the pattern may also match at any level below the .gitignore level.

and we show this with relativity field:

pub enum PatternRelativity {
    Anywhere,
    RelativeTo { directory: String },
}

Note that we are converting from the specification to Rust types. In the future, these enums can be extended to cover the cases where these patterns are used in other places, or we can merge some of them to reduce complexity. Currently, this level of abstraction seems adequate.

To convert Pattern<String> to Pattern<Glob> (or Pattern<Result<Glob, Error>>), or any other type, we can use a map function similar to Option::map

    fn map<U, F>(self, f: F) -> Pattern<U>
    where
        F: FnOnce(T) -> U,
    {
        let pat = Pattern::<U> {
            pattern: f(self.pattern),
            original: self.original,
            source: self.source,
            effect: self.effect,
            relativity: self.relativity,
            path_kind: self.path_kind,
        };
        pat
    }

map will make easy to convert between various types of patterns.

Now, we change the perspective to a higher level. Let's begin by writing a file tree walker.

pub fn walk_serial(
    given: IgnoreRules,
    dir: &Path,
    walk_options: &WalkOptions,
    sender: &Sender<WalkerResult<PathMetadata>>,
) -> WalkerResult<()> { todo!() }

This function uses the given IgnoreRules and returns files and directories via the channel. PathMetadata is just a bundle of PathBuf and its Metadata.

pub struct PathMetadata {
    path: PathBuf,
    metadata: Metadata,
}

WalkOptions will contain various options about this traversal, now it only has ignore_filename and include_dirs that sets whether only files are sent through the channel, or dirs should be included too.

pub struct WalkOptions<'a> {
    ignore_filename: Option<&'a str>,
    include_dirs: bool,
}

Initially, IgnoreRules contain only the files that should be ignored (or whitelisted) globally. When a directory contains a .gitignore file (or any file name that was set in WalkOptions), it checks new ignore rules and sends all files through the channel. If a directory is not ignored, it then calls walk_serial recursively with it.

The final lines of the function are as follows:

    for child_dir in child_dirs {
        walk_serial(dir_with_ignores.clone(), &child_dir, walk_options, sender)?;
    }

As it calls itself as a final operation after collecting and sending all child files, this is an example of tail recursion.

The output is sent through a channel because our final goal is to make all this concurrently. Each new directory will create a new thread, and all files will be collected from the channel. At this point, we are creating a serial version of the walker with the identical data structures, to test it easily.

One gotcha using the channels in serial version is that the channel size must be enough to contain all file names. Effectively this means it will be crossbeam_channel::unbounded for walk_serial. We will (ab)use channel's Sender<PathMetadata> as a Vec<PathMetadata> in no parallel mode. In parallel settings using unbounded channels are not recommended, if the consumers of channel don't work as quickly you may fill up the memory, especially for longer running processes.

walk_serial will basically do four things:

Read the directory

The first is done using read_dir of Path. It returns a set of elements. Then the list is processed and elements' metadata is retrieved. This requires a bit of error handling: read_dir may return Error, each element in the result can be Error and getting metadata may also result in Error.

    let elements = dir
        .read_dir()
        .map_err(|e| anyhow!("Error reading directory: {:?}, {:?}", dir, e))?;
    let mut child_dirs = Vec::<PathBuf>::new();
    let mut child_paths = Vec::<PathMetadata>::new();

    for entry in elements {
        match entry {
            Err(err) => sender.send(Err(WalkerError::from(anyhow!(
                "Error reading entry in dir {:?} {:?}",
                dir,
                err
            ))))?,
            Ok(entry) => match entry.metadata() {
                Err(err) => sender.send(Err(WalkerError::from(anyhow!(
                    "Error getting metadata {:?} {}",
                    entry,
                    err
                ))))?,
                Ok(md) => {
                    child_paths.push(PathMetadata {
                        path: entry.path(),
                        metadata: md.clone(),
                    });
                    if md.is_dir() {
                        child_dirs.push(entry.path());
                    }
                }
            },
        }
    }

Note that, we collect PathMetadata objects in child_paths vector. If there happens an error while retrieving, it's sent to the caller to ignore or to report to the user.

I usually handle errors first in match expressions, because they are usually more straightforward.

Check if there are new ignore rules

After reading all the files in the directory, it checks there actually is an ignore filename in given and that a file exists with that name in the list we just created. If there is no such file, we can use given rules to decide the ignored elements.

If there is a file that may change the ignore rules, it reads it and updates patterns. Then, if there are new ignore or whitelist rules, the respective globsets are recreated. It checks whether we have new rules because compiling globs is possibly an expensive operation.

Filtering directory elements

Filtering files and directories from the list checks two globsets. If an element matches a whitelist rule, it's included in the results without checking it also matches to ignore rules. Hence whitelisting has a priority. Otherwise it checks the ignore rules and if they don't match, the element is included in the results.

If there are directories matching the ignore rules, they are not traversed and any possible whitelisting rule that was specified within that directory are not taken into consideration.

Listing child directories

We saw how walk_serial calls itself for child directories above. After sending all results to the channel, it calls itself for the child directories and repeats.

Converting patterns from String to Glob

The interesting part about ignoring files is that we should modify patterns into globs to add to Globset. We cannot add the patterns directly to Globset because some of the rules make them context dependent. We parsed the patterns to get PatternRelativity and PathKind, and these also affect how we add the patterns to Globset.

The heart of the conversion is a function that receives a String, filters it and transforms it to a list of Pattern<Glob>s.

fn content_to_patterns(
    ignore_root: &Path,
    source: Option<&Path>,
    contents: &str,
) -> Vec<Pattern<WalkerResult<Glob>>> {

As we also need to report the line number for patterns, we use enumerate on the iterator:

    let patterns: Vec<Pattern<WalkerResult<Glob>>> = content
        .lines()
        .enumerate()
        // A line starting with # serves as a comment. Put a backslash ("\") in front of the first hash for patterns that begin with a hash.
        .filter(|(_, line)| !(line.trim().is_empty() || line.starts_with("#")))

The filter line is used to check blank and comment lines. Then we trim the trailing space unless the pattern ends with \

        .map(|(i, line)| {
            if !line.ends_with("\\ ") {
                (i, line.trim_end())
            } else {
                (i, line)
            }
        })

The function receives the source file name as a parameter. We create the Source using it and the line nummber:

        .map(|(i, line)| {
            (
                line,
                match source {
                    Some(p) => Source::File {
                        path: p
                            .strip_prefix(ignore_root)
                            .expect("path must be within ignore_root")
                            .to_path_buf(),
                        line: (i + 1).into(),
                    },
                    None => Source::Global,
                },
            )
        })

Then we build the pattern object, update the pattern to add before globset and build the globs. Building the glob may cause errors if it's not well-formed. They are handled as well.

        .map(|(line, source)| build_pattern(source, line))
        .map(transform_pattern_for_glob)
        .map(|pc| pc.map(|s| Glob::new(&s).map_err(WalkerError::from)))

Two functions left that may need attention. One is build_pattern that converts a pattern line to a Pattern<String> object. The other is transform_pattern_for_glob function that modifies Pattern<String> before building the glob.

After writing walk_serial, we'll use all the machinery to write a parallel version of it.

Parsing pattern strings for pattern objects

We create pattern objects from pattern strings following the spec. There are three basic rules that we will follow:

We'll also check whether the line starts with / and remove that to prevent double-slashes // in patterns. The resulting pattern in the object won't contain any of the artifacts that changes its semantics and we'll add them when building the globsets.

Let's start by checking whether the line starts with !.

    let begin_exclamation = original.starts_with("!");
    let line = if begin_exclamation || original.starts_with("\\!") {
        original[1..].to_owned()
    } else {
        original.to_owned()
    };

Checking the presence of / at the beginning, at the end, and in intermediate positions is also straightforward:

    let end_slash = line.ends_with("/");
    let line = if end_slash {
        &line[..line.len() - 1]
    } else {
        line
    };

    let begin_slash = line.starts_with("/");
    let non_final_slash = if line.len() > 0 {
        line[..line.len() - 1].chars().find(|c| *c == '/').is_some()
    } else {
        false
    };

We check non-final slash separately from the beginning slash, because we'll remove the initial slash if it exists.

    let line = if begin_slash { &line[1..] } else { line };

Although there is almost 1-1 correspondence between the boolean conditions and the enums, I create enums in separate statements for self-documenting code.

    let effect = if begin_exclamation {
        PatternEffect::Whitelist
    } else {
        PatternEffect::Ignore
    };

    let path_kind = if end_slash {
        PathKind::Directory
    } else {
        PathKind::Any
    };

    let relativity = if non_final_slash {
        PatternRelativity::RelativeTo {
            directory: current_dir.to_owned(),
        }
    } else {
        PatternRelativity::Anywhere
    };

After building these attributes of the pattern, and strip the pattern from initial and final /, and other semantic-changing elements, we build a pattern:


    let pattern = Pattern::<String> {
        pattern: line.to_owned(),
        original: original.to_owned(),
        source,
        effect,
        relativity,
        path_kind,
    };

Transforming pattern string to a suitable glob

Now, another transformation is needed to convert Pattern<String> to Pattern<Glob>. In this case, we'll need to recreate the pattern based on the rules.

The transformation considers relativity and path_kind to generate a glob that can be added to a GlobSet. Note that the function is returns not a compiled Pattern<Glob>, but a Pattern<String>, because after this transformation building the glob requires Glob::new.

fn transform_pattern_for_glob(pattern: Pattern<String>) -> Pattern<String> {
    let anything_anywhere = |p| format!("**/{p}");
    let anything_relative = |p, directory| format!("{directory}/**/{p}");
    let directory_anywhere = |p| format!("**{p}/");
    let directory_relative = |p, directory| format!("{directory}/**/{p}/");

    let transformed_pattern = match (&pattern.path_kind, &pattern.relativity) {
        (PathKind::Any, PatternRelativity::Anywhere) => anything_anywhere(pattern.pattern),
        (PathKind::Any, PatternRelativity::RelativeTo { directory }) => {
            anything_relative(pattern.pattern, directory)
        }
        (PathKind::Directory, PatternRelativity::Anywhere) => directory_anywhere(pattern.pattern),
        (PathKind::Directory, PatternRelativity::RelativeTo { directory }) => {
            directory_relative(pattern.pattern, directory)
        }
    };

    Pattern {
        pattern: transformed_pattern,
        ..pattern
    }
}

I could merge the closures to match and lower the number of lines, but adding the closures increased self-documenting. Instead of multiple if/else branches, I tend to use match with tuples to have a clearer picture. The corresponding if/else code without closures would be like:


let transformed_pattern = if pattern.path_kind == PathKind::Any {
    if pattern.relativity == PatternRelativity::Anywhere {
        format!("**/{p}")
        } else if ...
        }
...

and in my opinion this latter format is more error prone and less readable.

Combining it all together

At this point we can revisit content_to_patterns again:

fn content_to_patterns(
    ignore_root: &Path,
    source: Option<&Path>,
    content: &str,
) -> Vec<Pattern<Result<Glob>>> {
    watch!(source);
    let patterns: Vec<Pattern<Result<Glob>>> = content
        .lines()
        .enumerate()
        ...
        .map(|(line, source)| build_pattern(source, line))
        .map(transform_pattern_for_glob)
        .map(|pc| pc.map(|s| 
            Glob::new(&s).map_err(Error::from)))
        .collect();

    patterns
}

The last step in the transformation is creating the Glob object as in the highlighted lines above. It runs Pattern<String>.map with Glob::new to get a Pattern<Result<Glob>>. Error and Result are custom types defined within the crate using thiserror crate.

After we have a vector of Pattern<Result<Glob>>, we report the errors in clear_glob_errors to get Vec<Pattern<Glob>> from this vector. The function is defined as:

fn clear_glob_errors(
    sender: &Sender<Result<PathMetadata>>,
    new_patterns: Vec<Pattern<Result<Glob>>>,
) -> Vec<Pattern<Glob>> {
    let new_glob_patterns: Vec<Pattern<Glob>> = new_patterns
        .into_iter()
        .filter_map(|p| match p.transpose() {
            Ok(p) => Some(p),
            Err(e) => {
                sender
                    .send(Err(Error::from(anyhow!("Error in glob pattern: {:?}", e))))
                    .expect("Error in channel");
                None
            }
        })
        .collect();
    new_glob_patterns
}

sender is the same channel that we use to send the results. This communication channel is used to report only the errors, not PathMetadata, hence when we meet a Err(e) in new_patterns, it's reported to the same channel that the results are expected. Handling the errors first seems like a common theme here as well.

This is also one of the reasons walk_serial receives a channel instead of a vector. Our usual use case is traversing all the directories in parallel, and without channel based communication in serial version, we would need to duplicate this function.

At this point I completed walk_serial functionality. I write tests. I'm using test-case crate for writing many tests encompassing different use cases. I'm omitting them here for brevity.

Writing a Parallel Walker

Converting walk_serial to a parallel one is quite trivial. As we already used channels in the serial version, we have all the components of a parallel walker. For the sake of simplicity, I won't use a thread pool and spawn a new thread for each task. One question is what's the unit task that we'll use to create tasks.

jwalk uses directories as points to spawn new tasks. That's a sensible approach and I'll use this as well. This means a single directory will be processed by a single thread, even if it contains one million files, and 10 directories will be processed by 10 threads even if each contain one file. In practice, I think we can assume directories are natural chunks for this problem.

The interface for parallel walker is identical with the serial version. This is to replace one with the other when needs arise.1

1

Update: I decided to change the walk_serial interface to use Vec instead of Sender for simplicity.

The only difference is in the final loop. walk_serial calls itself recursively at the end, walk_parallel calls itself in separate threads.


pub fn walk_parallel(
    given: IgnoreRules,
    dir: &Path,
    walk_options: WalkOptions,
    sender: Sender<Result<PathMetadata>>,
) -> Result<()> {
    ....
    crossbeam::scope(|s| {
        for child_dir in child_dirs {
            let dwi = dir_with_ignores.clone();
            let walk_options = walk_options.clone();
            let sender = sender.clone();
            s.spawn(move |_| walk_parallel(dwi, &child_dir.path, walk_options, sender));
        }
    })
    .expect("Error in crossbeam scope in walk_parallel");

    Ok(())
}


Code

The code is soon to be released, and I'll add a link to it. You can check my Github profile if I forget.