Developing a gitignore crate
I needed a file system ignore library for a utility I write. This is similar to Git’s .gitignore
, but the files containing the rules can have different names, and ignore rules may be defined programmatically.
I was using burntsushi’s ignore crate, which in turn uses globset by the same author. In a directory hierarchy, I was first collecting all .ignore
files, then trying to decide which files are ignored. Rg Gitignore’s from parameter led me to believe this.
However, it looks, it has a bug that doesn’t consider the source directory of ignore file. Git has different semantics when a/.gitignore
and a/b/.gitignore
has a line mydir/
. In the second case, Git doesn’t ignore a/mydir/
but Ripgrep seems to consider all elements from all files as if they are from the root directory.
In practice, most people may be using ignore files at their root directory but there are valid use cases (like moving around directories with ignore patterns in it) that is affected by such behavior.
I decided to tackle with this. Ripgrep’s globset seems nice and fast, so better to base the solution on top of that. My basic use case is walking a directory hierarchy. There are solutions to this like walkdir or jwalk but they use ripgrep’s gitignore, hence they have the same bug.
I usually begin to set up data structures to understand the problem. This allows me to understand the domain formally.
First, what are we trying to accomplish? Basically, we want to decide if a path should be ignored (a) in the results and (b) in the directory traversal. Gitignore specification not only allows ignores, but also has a way to add paths to a whitelist to unignore. Hence deciding whether to ignore a path is like:
let include = |path| {
if whitelist_globs.match(path) {
true
} else if ignore_globs.match(path) {
false
} else {
true }
}
We also need to decide whether to traverse to a directory:
let traverse = |path| {
if path.is_dir() && !include(path) { false } else { true }
}
We have the following possible MatchResult
s between the ignore rules and a path:
pub enum MatchResult {
NoMatch,
Ignore,
Whitelist,
}
We need a comparison function to get a MatchResult
. One side of the comparison is path
, and the other is a set of GlobSet
s from globset crate. We’ll have two globsets, one for ignores
and the other for whitelists
.
pub struct IgnoreRules {
root: PathBuf,
patterns: Arc<Vec<Pattern<Glob>>>,
whitelist_set: Arc<GlobSet>,
ignore_set: Arc<GlobSet>,
}
The data types for the sets are Arc<GlobSet>
, as I plan to make the traversal concurrent for each directory. However, as only one thread will be modifying the rules (one .gitignore
or .ignore
file per directory), I felt no need to wrap the sets with Mutex
.
We also keep track of each pattern separately. patterns
is a Pattern
vector that contain all globs individually. One reason is to rebuild the GlobSet as we traverse directories: We cannot add any more rules to the sets once we have built, we are keeping the patterns separately to add them to sets of the child directories.
Another reason is to keep track of the source, as this can be used for debugging or reporting, similar to git check-ignore
.
root
is the point where this traversal began.
Patterns are defined in a generic way to instantiate with globs, strings or regexes or any other type.
pub struct Pattern<T> {
pattern: T,
original: String,
source: Source,
effect: PatternEffect,
relativity: PatternRelativity,
path_kind: PathKind,
}
The rationale for this generic approach is that the patterns need some kind of transformation. They can be read from files as strings, then compiled into globs or regexes.
flowchart LR
String-->Ok(Glob)-->Glob-->Regex
source
field shows where this pattern is retrieved. It’s an enum with two possible values for the time being:
enum Source {
File { path: PathBuf, line: usize },
Global,
}
Gitignore specification says:
An optional prefix “!” which negates the pattern; any matching file excluded by a previous pattern will become included again.
We use effect
field is to tag the pattern either as Ignore
or a Whitelist
pattern.
pub enum PatternEffect {
Ignore,
Whitelist,
}
The spec also makes a distinction between files and directories:
If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.
So we use path_kind
to reflect this distinction:
pub enum PathKind {
Any,
Directory,
}
Some patterns can be relative to the directory they are found:
If there is a separator at the beginning or middle (or both) of the pattern, then the pattern is relative to the directory level of the particular .gitignore file itself. Otherwise the pattern may also match at any level below the .gitignore level.
and we show this with relativity
field:
pub enum PatternRelativity {
Anywhere,
RelativeTo { directory: String },
}
Note that we are converting from the specification to Rust types. In the future, these enums can be extended to cover the cases where these patterns are used in other places, or we can merge some of them to reduce complexity. Currently, this level of abstraction seems adequate.
To convert Pattern<String>
to Pattern<Glob>
(or Pattern<Result<Glob, Error>>
), or any other type, we can use a map
function similar to Option::map
fn map<U, F>(self, f: F) -> Pattern<U>
where
F: FnOnce(T) -> U,
{
let pat = Pattern::<U> {
pattern: f(self.pattern),
original: self.original,
source: self.source,
effect: self.effect,
relativity: self.relativity,
path_kind: self.path_kind,
};
pat
}
map
will make easy to convert between various types of patterns.
Now, we change the perspective to a higher level. Let’s begin by writing a file tree walker.
pub fn walk_serial(
given: IgnoreRules,
dir: &Path,
walk_options: &WalkOptions,
sender: &Sender<WalkerResult<PathMetadata>>,
) -> WalkerResult<()> { todo!() }
This function uses the given IgnoreRules
and returns files and directories via the channel. PathMetadata
is just a bundle of PathBuf
and its Metadata
.
pub struct PathMetadata {
path: PathBuf,
metadata: Metadata,
}
WalkOptions
will contain various options about this traversal, now it only has ignore_filename
and include_dirs
that sets whether only files are sent through the channel, or dirs should be included too.
pub struct WalkOptions<'a> {
ignore_filename: Option<&'a str>,
include_dirs: bool,
}
Initially, IgnoreRules
contain only the files that should be ignored (or whitelisted) globally. When a directory contains a .gitignore
file (or any file name that was set in WalkOptions
), it checks new ignore rules and sends all files through the channel. If a directory is not ignored, it then calls walk_serial
recursively with it.
The final lines of the function are as follows:
for child_dir in child_dirs {
walk_serial(dir_with_ignores.clone(), &child_dir, walk_options, sender)?;
}
As it calls itself as a final operation after collecting and sending all child files, this is an example of tail recursion.
The output is sent through a channel because our final goal is to make all this concurrently. Each new directory will create a new thread, and all files will be collected from the channel. At this point, we are creating a serial version of the walker with the identical data structures, to test it easily.
One gotcha using the channels in serial version is that the channel size must be enough to contain all file names. Effectively this means it will be crossbeam_channel::unbounded
for walk_serial
. We will (ab)use channel’s Sender<PathMetadata>
as a Vec<PathMetadata>
in no parallel mode. In parallel settings using unbounded
channels are not recommended, if the consumers of channel don’t work as quickly you may fill up the memory, especially for longer running processes.
walk_serial
will basically do four things:
- List all the elements in the given directory
- Check if there is a new ignore file and update the rules if necessary
- Filter the elements by ignore rules and send them to channel
- Run
walk_serial
recursively for child directories if they are not ignored
Read the directory
The first is done using read_dir
of Path
. It returns a set of elements. Then the list is processed and elements’ metadata is retrieved. This requires a bit of error handling: read_dir
may return Error
, each element in the result can be Error
and getting metadata
may also result in Error
.
let elements = dir
.read_dir()
.map_err(|e| anyhow!("Error reading directory: {:?}, {:?}", dir, e))?;
let mut child_dirs = Vec::<PathBuf>::new();
let mut child_paths = Vec::<PathMetadata>::new();
for entry in elements {
match entry {
Err(err) => sender.send(Err(WalkerError::from(anyhow!(
"Error reading entry in dir {:?} {:?}",
dir,
err
))))?,
Ok(entry) => match entry.metadata() {
Err(err) => sender.send(Err(WalkerError::from(anyhow!(
"Error getting metadata {:?} {}",
entry,
err
))))?,
Ok(md) => {
child_paths.push(PathMetadata {
path: entry.path(),
metadata: md.clone(),
});
if md.is_dir() {
child_dirs.push(entry.path());
}
}
},
}
}
Note that, we collect PathMetadata
objects in child_paths
vector. If there happens an error while retrieving, it’s sent to the caller to ignore or to report to the user.
I usually handle errors first in match
expressions, because they are usually more straightforward.
Check if there are new ignore rules
After reading all the files in the directory, it checks there actually is an ignore filename in given
and that a file exists with that name in the list we just created. If there is no such file, we can use given
rules to decide the ignored elements.
If there is a file that may change the ignore rules, it reads it and updates patterns
. Then, if there are new ignore or whitelist rules, the respective globsets are recreated. It checks whether we have new rules because compiling globs is possibly an expensive operation.
Filtering directory elements
Filtering files and directories from the list checks two globsets. If an element matches a whitelist rule, it’s included in the results without checking it also matches to ignore rules. Hence whitelisting has a priority. Otherwise it checks the ignore rules and if they don’t match, the element is included in the results.
If there are directories matching the ignore rules, they are not traversed and any possible whitelisting rule that was specified within that directory are not taken into consideration.
Listing child directories
We saw how walk_serial
calls itself for child directories above. After sending all results to the channel, it calls itself for the child directories and repeats.
Converting patterns from String to Glob
The interesting part about ignoring files is that we should modify patterns into globs to add to Globset. We cannot add the patterns directly to Globset because some of the rules make them context dependent. We parsed the patterns to get PatternRelativity
and PathKind
, and these also affect how we add the patterns to Globset.
The heart of the conversion is a function that receives a String
, filters it and transforms it to a list of Pattern<Glob>
s.
fn content_to_patterns(
ignore_root: &Path,
source: Option<&Path>,
contents: &str,
) -> Vec<Pattern<WalkerResult<Glob>>> {
As we also need to report the line number for patterns, we use enumerate
on the iterator:
let patterns: Vec<Pattern<WalkerResult<Glob>>> = content
.lines()
.enumerate()
// A line starting with # serves as a comment. Put a backslash ("\") in front of the first hash for patterns that begin with a hash.
.filter(|(_, line)| !(line.trim().is_empty() || line.starts_with("#")))
The filter
line is used to check blank and comment lines. Then we trim the trailing space unless the pattern ends with \
.map(|(i, line)| {
if !line.ends_with("\\ ") {
(i, line.trim_end())
} else {
(i, line)
}
})
The function receives the source file name as a parameter. We create the Source
using it and the line nummber:
.map(|(i, line)| {
(
line,
match source {
Some(p) => Source::File {
path: p
.strip_prefix(ignore_root)
.expect("path must be within ignore_root")
.to_path_buf(),
line: (i + 1).into(),
},
None => Source::Global,
},
)
})
Then we build the pattern object, update the pattern to add before globset and build the globs. Building the glob may cause errors if it’s not well-formed. They are handled as well.
.map(|(line, source)| build_pattern(source, line))
.map(transform_pattern_for_glob)
.map(|pc| pc.map(|s| Glob::new(&s).map_err(WalkerError::from)))
Two functions left that may need attention. One is build_pattern
that converts a pattern line to a Pattern<String>
object. The other is transform_pattern_for_glob
function that modifies Pattern<String>
before building the glob.
After writing walk_serial
, we’ll use all the machinery to write a parallel version of it.
Parsing pattern strings for pattern objects
We create pattern objects from pattern strings following the spec. There are three basic rules that we will follow:
- If the pattern starts with
!
, it’s a whitelist rule. - If the pattern ends with
/
, it matches only to directories. - If the pattern contains a non-final
/
, it’s relative to the directory.
We’ll also check whether the line starts with /
and remove that to prevent double-slashes //
in patterns. The resulting pattern in the object won’t contain any of the artifacts that changes its semantics and we’ll add them when building the globsets.
Let’s start by checking whether the line starts with !
.
let begin_exclamation = original.starts_with("!");
let line = if begin_exclamation || original.starts_with("\\!") {
original[1..].to_owned()
} else {
original.to_owned()
};
Checking the presence of /
at the beginning, at the end, and in intermediate positions is also straightforward:
let end_slash = line.ends_with("/");
let line = if end_slash {
&line[..line.len() - 1]
} else {
line
};
let begin_slash = line.starts_with("/");
let non_final_slash = if line.len() > 0 {
line[..line.len() - 1].chars().find(|c| *c == '/').is_some()
} else {
false
};
We check non-final slash separately from the beginning slash, because we’ll remove the initial slash if it exists.
let line = if begin_slash { &line[1..] } else { line };
Although there is almost 1-1 correspondence between the boolean conditions and the enums, I create enums in separate statements for self-documenting code.
let effect = if begin_exclamation {
PatternEffect::Whitelist
} else {
PatternEffect::Ignore
};
let path_kind = if end_slash {
PathKind::Directory
} else {
PathKind::Any
};
let relativity = if non_final_slash {
PatternRelativity::RelativeTo {
directory: current_dir.to_owned(),
}
} else {
PatternRelativity::Anywhere
};
After building these attributes of the pattern, and strip the pattern from initial and final /
, and other semantic-changing elements, we build a pattern:
let pattern = Pattern::<String> {
pattern: line.to_owned(),
original: original.to_owned(),
source,
effect,
relativity,
path_kind,
};
Transforming pattern string to a suitable glob
Now, another transformation is needed to convert Pattern<String>
to Pattern<Glob>
. In this case, we’ll need to recreate the pattern based on the rules.
The transformation considers relativity
and path_kind
to generate a glob that can be added to a GlobSet. Note that the function is returns not a compiled Pattern<Glob>
, but a Pattern<String>
, because after this transformation building the glob requires Glob::new
.
fn transform_pattern_for_glob(pattern: Pattern<String>) -> Pattern<String> {
let anything_anywhere = |p| format!("**/{p}");
let anything_relative = |p, directory| format!("{directory}/**/{p}");
let directory_anywhere = |p| format!("**{p}/");
let directory_relative = |p, directory| format!("{directory}/**/{p}/");
let transformed_pattern = match (&pattern.path_kind, &pattern.relativity) {
(PathKind::Any, PatternRelativity::Anywhere) => anything_anywhere(pattern.pattern),
(PathKind::Any, PatternRelativity::RelativeTo { directory }) => {
anything_relative(pattern.pattern, directory)
}
(PathKind::Directory, PatternRelativity::Anywhere) => directory_anywhere(pattern.pattern),
(PathKind::Directory, PatternRelativity::RelativeTo { directory }) => {
directory_relative(pattern.pattern, directory)
}
};
Pattern {
pattern: transformed_pattern,
..pattern
}
}
I could merge the closures to match
and lower the number of lines, but adding the closures increased self-documenting. Instead of multiple if/else
branches, I tend to use match
with tuples to have a clearer picture. The corresponding if/else code without closures would be like:
let transformed_pattern = if pattern.path_kind == PathKind::Any {
if pattern.relativity == PatternRelativity::Anywhere {
format!("**/{p}")
} else if ...
}
...
and in my opinion this latter format is more error prone and less readable.
Combining it all together
At this point we can revisit content_to_patterns
again:
1 fn content_to_patterns(
2 ignore_root: &Path,
3 source: Option<&Path>,
4 content: &str,
5 ) -> Vec<Pattern<Result<Glob>>> {
6 watch!(source);
7 let patterns: Vec<Pattern<Result<Glob>>> = content
8 .lines()
9 .enumerate()
10 ...
11 .map(|(line, source)| build_pattern(source, line))
12 .map(transform_pattern_for_glob)
13 .map(|pc| pc.map(|s|
14 Glob::new(&s).map_err(Error::from)))
15 .collect();
16
17 patterns
18 }
19
The last step in the transformation is creating the Glob
object as in the highlighted lines above. It runs Pattern<String>.map
with Glob::new
to get a Pattern<Result<Glob>>
. Error
and Result
are custom types defined within the crate using thiserror
crate.
After we have a vector of Pattern<Result<Glob>>
, we report the errors in clear_glob_errors
to get Vec<Pattern<Glob>>
from this vector. The function is defined as:
1 fn clear_glob_errors(
2 sender: &Sender<Result<PathMetadata>>,
3 new_patterns: Vec<Pattern<Result<Glob>>>,
4 ) -> Vec<Pattern<Glob>> {
5 let new_glob_patterns: Vec<Pattern<Glob>> = new_patterns
6 .into_iter()
7 .filter_map(|p| match p.transpose() {
8 Ok(p) => Some(p),
9 Err(e) => {
10 sender
11 .send(Err(Error::from(anyhow!("Error in glob pattern: {:?}", e))))
12 .expect("Error in channel");
13 None
14 }
15 })
16 .collect();
17 new_glob_patterns
18 }
sender
is the same channel that we use to send the results. This communication channel is used to report only the errors, not PathMetadata
, hence when we meet a Err(e)
in new_patterns
, it’s reported to the same channel that the results are expected. Handling the errors first seems like a common theme here as well.
This is also one of the reasons walk_serial
receives a channel instead of a vector. Our usual use case is traversing all the directories in parallel, and without channel based communication in serial version, we would need to duplicate this function.
At this point I completed walk_serial
functionality. I write tests. I’m using test-case
crate for writing many tests encompassing different use cases. I’m omitting them here for brevity.
Writing a Parallel Walker
Converting walk_serial
to a parallel one is quite trivial. As we already used channels in the serial version, we have all the components of a parallel walker. For the sake of simplicity, I won’t use a thread pool and spawn a new thread for each task. One question is what’s the unit task that we’ll use to create tasks.
jwalk uses directories as points to spawn new tasks. That’s a sensible approach and I’ll use this as well. This means a single directory will be processed by a single thread, even if it contains one million files, and 10 directories will be processed by 10 threads even if each contain one file. In practice, I think we can assume directories are natural chunks for this problem.
The interface for parallel walker is identical with the serial version. This is to replace one with the other when needs arise.1
Update: I decided to change the `walk_serial` interface to use `Vec`
instead of `Sender` for simplicity.
The only difference is in the final loop. walk_serial
calls itself recursively at the end, walk_parallel
calls itself in separate threads.
pub fn walk_parallel(
given: IgnoreRules,
dir: &Path,
walk_options: WalkOptions,
sender: Sender<Result<PathMetadata>>,
) -> Result<()> {
....
crossbeam::scope(|s| {
for child_dir in child_dirs {
let dwi = dir_with_ignores.clone();
let walk_options = walk_options.clone();
let sender = sender.clone();
s.spawn(move |_| walk_parallel(dwi, &child_dir.path, walk_options, sender));
}
})
.expect("Error in crossbeam scope in walk_parallel");
Ok(())
}
Code
The code is soon to be released, and I’ll add a link to it. You can check my Github profile if I forget.