Source: Day 19: Linen Layout
Full solution for today (spoilers!).
Part 1
Given a comma delimited list of substrings and a list of strings, count how many of the latter strings can be made up of any (repeating) combination of the former.
Full solution for today (spoilers!).
Given a comma delimited list of substrings and a list of strings, count how many of the latter strings can be made up of any (repeating) combination of the former.
Hey, I said that I would follow up on my post about Building Myself a Grep… well here it is!
And I’m actually surprised with myself in how far I actually made it!
You can see the current state of my code on Github. You can install it from that repo (checked out) with cargo install --path .
I mostly worked off the MDN documentation:
Assertions:
^
and $
for entire patternsCharacter classes
[abc]
[a-z]
[^abc]
.
\d
/\D
for digits, \w
/\W
for ‘words’, and \s
/\S
for whitespace\t\r\n\v\f
\cX
(I’ve never used these)\hXX
and \uXXXX
|
(both in capture groups and not)Groups and back references
(abc)
(?<name>abc)
(?:abc)
(?ims-ims:abc)
i
and s
but not m
\n
\k<name>
Quantifiers
*
for zero or more+
for one or more?
for zero or one*?
, +?
, and ??
for lazy / non-greedy matchesabc{n}
exactly n matchesabc{n,}
at least n matchesabc{,m}
up to m matchesabc{n,m}
at least n and up to m matches (inclusive)Most of those were fairly straight forward extensions of previous code. In think the most interesting ones were handling the parsing of all the different things that can go in groups (including flags).
For each of them, you can check my git commit history to see how I implemented specific things. It’s mostly one commit per feature, but not always.
Assertions:
\b
and \B
)Character classes
[\b]
for backspace characters\u{XXXXX}
\p{...}
/\P{...}
Groups and back references:
m
flag / mode: multiline matchesThe look ahead/behind is the one I’m most interested in supporting. I don’t even think it will be that hard, I just honestly missed it.
The more interesting one will be the m
flag. Currently, I only match lines, so that will be a decently large restructuring. We’ll see.
I’ve made an awful lot of progress on this one too!
$ jp-grep --help
A custom grep implementation; always behaves as egrep
Usage: jp-grep [OPTIONS] [PATTERN] [PATHS]...
Arguments:
[PATTERN] The regular expression to evaluate; may also be specified with -e
[PATHS]... Paths to search for matches; if none are provided read from stdin
Options:
-A, --after-context <AFTER_CONTEXT>
Lines of context to print after each match
-B, --before-context <BEFORE_CONTEXT>
Lines of context to print before each match
-C, --context <CONTEXT>
Lines to print both before and after
-c, --count
Only print the matching count
-E, --extended-regexp
Extended regex mode (egrep); this option is ignored (always true)
-e, --regexp <ADDITIONAL_PATTERNS>
Additional patterns, will return a line if any match
-h, --no-filename
Never print filenames
--help
Display this help message
-i, --ignore-case
Default to case insensitive match
-n, --line-number
Print line numbers before matches and context
-r, --recursive
Recursively add any directories (-R also works)
-v, --invert-match
Invert the match; only print lines that don't match any pattern
-V, --version
Print version
Of those, the context flags (-A
, -B
, and -C
) were probably the most tricky, since I basically had to implement a circular buffer for them. I could have just read the entire file into memory, but from the beginning, I didn’t want to do that.
-E
is a little silly, since that’s the only grep
pattern I support (and the only one I actually use in grep
, so that’s fair).
So far as supporting multiple files, recursive search, and stdin, read the section on collecting files later.
So far as printing (handling line numbers and file names), read the section on printing lines.
Overall, pretty fun code.
So far, there are a bunch of flags that I don’t support for grep. Of those, there are a bunch that I don’t intend to support (like built in compression support and properly dealing with symlinks).
The things that I would still like to support though are:
Input options:
-f file
/--file=file
- Read patterns from fileOutput options:
-a
/--text
- Currently I always have this set; I don’t treat binary files differently-L
/--files-without-match
- only print files that don’t match-o
/--only-matching
- only print the matching groups; I have the groups for backreferences, use them!File filtering - files to include/exclude (useful with recursive matches):
--exclude pattern
--exclude-dir pattern
--include pattern
--include-dir pattern
That’s not too bad, all things consider.
One thing that I actually played a bit with this time around was custom error handling in the parser. Rather than just returning &str
all over the place for Err
types, I made my own:
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(crate) enum ParserError {
RemainingInput,
UnexpectedEnd,
InvalidCharacter(char, &'static str),
InvalidUnicodeCodePoint(u32),
InvalidRange(char, char),
InvalidRepeatRange(u32, u32),
}
impl std::fmt::Display for ParserError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ParserError::RemainingInput => write!(f, "Unexpected input after parsing"),
ParserError::UnexpectedEnd => write!(f, "Unexpected end of input"),
ParserError::InvalidCharacter(c, expected) => {
write!(f, "Invalid character '{}', expected {}", c, expected)
}
ParserError::InvalidUnicodeCodePoint(code_point) => {
write!(f, "Invalid unicode code point: {}", code_point)
}
ParserError::InvalidRange(start, end) => {
write!(f, "Invalid range: {}-{}", start, end)
}
ParserError::InvalidRepeatRange(start, end) => {
write!(f, "Invalid range: {}-{}", start, end)
}
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(crate) struct ParserErrorWithPosition {
pub position: usize,
pub error: ParserError,
}
The WithPosition
type also lets me pinpoint exactly where in a pattern I failed:
jp-grep 'this is some long complicated pattern, \hXX see?'
Error parsing regex: Invalid character 'X', expected hex digit
| this is some long complicated pattern, \hXX see?
| ^
That’s pretty neat and I hope helpful! 😄
Overall, I’m pretty happy with this project. It’s got a pretty decent chunk of code, including…
$ jp-grep -c -v -e '//' -e '^\s*$' **/*.rs
1241
…over 1000 lines of Rust code, including tests but not blank lines or comments. 😄
I’ll probably pick this up at least once more.
Now… will I actually use this? Probably not. But it was certainly interesting to write.
Other than that, was CodeCrafters actually helpful for this? Middling. It was the kick I needed to actually do it (I’ve been meaning to write this for years at this point) and once I was started, I could finish it. On the other hand, the output format they require was a bit annoying at times, I’ve mostly moved away from that.
Still, worth I think. I’ll probably continue to do their free programs. Kafka is up next. Whee servers!
I recently stumbled across CodeCrafters again1. In a nutshell, they give a number of ‘Build Your Own…’ courses, each of which will automatically create a repo for you, guide you through solving the program step by step, and provide some feedback on the way.
On one hand, it’s a freemium (one problem a month is free) / paid service. I wish they had tiers. I really think their monthly fee is a bit steep for what they offer (we’ll come back to that). But on the other hand, it’s a neat tool and I’ve been wanting some more larger programming projects to learn more Rust on, so away we go!
First up, grep!
Part 1: Given a list of passphrases, count how many contain no duplicate words.
Part 1: The input is a list of strings, potentially containing sequences in square brackets. Find all strings that have an ABBA sequence (two characters followed by the same two in reverse order) outside of any square brackets, but no ABBA sequences in square brackets.
Part 1: A room is described as a name, a sector ID, and a checksum as follows:
aaaaa-bbb-z-y-x-123[abxyz]
name: aaaaa-bbb-z-y-x sector ID: 123 checksum: abxyz
Today’s five minute post brought to you via Programming Praxis / Career Cup:
Given a positive integer, return all the ways that the integer can be represented by letters using the mapping 1 -> A, 2 -> B, …, 26 -> Z. For instance, the number 1234 can be represented by the words ABCD, AWD and LCD.
In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?
Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:
"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"