Import named regex captures as fish variables#7459
Conversation
krobelus
left a comment
There was a problem hiding this comment.
Sweet, this looks really powerful!
I'm not sure why the MacOS tests are failing.
Can we enable this whenever a named capture group occurs in the regex?
This way we don't need the --import flag.
If we ever want to support unnamed captures, we'd probably need the flag again.
($1, $2, ... are open though a single array variable seems more idiomatic).
| // We don't have a way of having empty values in the middle of a multi-entry fish array, | ||
| // so we compromise by leaving the value unset in the case of !opts.all but assign an | ||
| // empty value otherwise. (opts.all is always true if !first_time) | ||
| if (!value_found && opts.all) { |
There was a problem hiding this comment.
I wonder what the behavior with multiple arguments should be.
If we support that we should probably clear the capture variables before any argument is processed.
Currently the behavior is consistent, but always resets to the match of the last argument:
$ build/fish -c '
string match -rqI "(?<word>[^. ]+) ?(?<punct>[.])?" "dot1.nodot1" "nodot2 dot2"
set -S word punct
'
$word: set in global scope, unexported, with 1 elements
$word[1]: |nodot2|
$punct: set in global scope, unexported, with 0 elementsIf I remove this part
if (first_time) {
// Make sure we clear out any existing values matching the named groups, because
// we'll be appending instead of overwriting from here on out.
vars.set_empty(name_entry.name, ENV_DEFAULT);
}then I get something that looks inconsistent, because I'm matching against multiple strings but without --all to trigger this insertion of emtpy strings for missing captures.
$ build/fish -c '
set word; set punct
string match -rqI "(?<word>[^. ]+) ?(?<punct>[.])?" "dot1.nodot1" "nodot2 dot2"
set -S word punct
'
$word: set in global scope, unexported, with 2 elements
$word[1]: |dot1|
$word[2]: |nodot2|
$punct: set in global scope, unexported, with 1 elements
$punct[1]: |.|I guess we could extend this compromise to cases where multiple arguments are passed,
conceptually (opts.all || aiter.size() > 1).
| public: | ||
| pcre2_matcher_t(const wchar_t *argv0_, const wcstring &pattern, const options_t &opts, | ||
| io_streams_t &streams) | ||
| io_streams_t &streams, parser_t &parser_) |
There was a problem hiding this comment.
Usually member variables have the underscore. So the parameter would be parser and the member variable would be parser_.
There was a problem hiding this comment.
@ridiculousfish I agree, but was sticking to the local convention (see parameter argv0_). Would you like me to change them all?
|
This is awesome! I can see a lot of scripts being simplified by this! I agree with krobelus that we do not need the Please update the An unresolved question is what scope the variables get set in. I don't think anybody wants a bunch of additional scope flags like we have for |
|
Thanks, guys. I was thinking of things that could be done to improve the correctness of shell scripts, and struck upon this as low-hanging fruit that could provide some limited safety by at least reducing manual indexing and string manipulation, if not also guaranteeing the shape of the resulting variables. I'm cautiously happy to try making it import by default. I think it could really go some ways to make regex in fish nicer, but with some caveats pertaining to scope and specific variables (more below). I did not think of doing In addition to the odd behavior that @krobelus spotted (thanks, I had focused on the behavior of single matching vs non-matching and subsequent --all matching and non-matching, but completely missed --all not matching even the first time around -- my intention is for the same behavior as the single non-matching), there are a few other issues which (in addition to updating the docs and release notes) are part of the reason I opened a PR:
I intentionally did not introduce scoping controls and of course @ridiculousfish caught that immediately :) I think we're both in agreement about not wanting to add too many knobs here; I have been giving scopes some thought, and I think something very similar to the default I don't think hard-coding # consider both with and without the following two lines
set word1
set word2
if test $version -gt 2
foo | string match -rq "(?<word1>hello) (?<word2>world)"
else
foo | string match -rq "(?<word1>adios), (?<word2>amigo)"
end
echo $word1 $word2With a hard-coded local scope, the function-local declarations on line 2 would not help hoist the scope of the variables, right? (Also, |
45d07db to
8d0a98e
Compare
|
The CI is failing on Linux/Clang and macOS. I've tested both Linux/Clang and macOS/gcc and can't reproduce, they run fine for me. Are those environments using a different PCRE2 or perhaps the system PCRE2? |
The second Travis build uses the vendored PCRE2, the rest, best as I can tell, uses system PCRE2. Presumably the system pcre should be the same version across the linux builds at least, so I don't think that's the issue (as there are two successes and one failure with it) |
Right this currently allows overriding
On a second thought, set matches (foo | string match -rq "(?<word1>hello) (?<word2>world)")
set word1 $matches[1]
set word2 $matches[2]The scoping issue is tricky. Probably the current default does the right thing most of the time:
Agreed, I would also prefer explicit writes for universals and globals - when inside a function. Outside functions, writing globals by default seems fine. Python does this well IMO. |
Replace with a class-local `enum class` instance.
The new commandline switch `string match --regex --import` will import as fish variables any named capture groups with the matched captures as the value(s).
* Optimize regex variable imports by scanning the name table only once per regex. * Unify the resulting variables in case of unmatching regex both with and without `--all` (always empty in case of no matches).
|
The little-check tests do not report stderr? It turns out I was running into an edge case of I know I've seen ASAN errors reported in the CI previously, but it seems they're silently dropped when I've pushed out a refactored version of the code that should address the performance concerns I had as well as fix the divergent behavior of importing variables with and without |
| set foo foo | ||
| echo "foo" | string match -rqI -- '^(?<foo>bar)$' | ||
| echo $foo | ||
| # CHECK: |
There was a problem hiding this comment.
I'd expect string match to not set any variable when nothing was matched.
I don't know in which situations that would make a difference.
They should, I get an error if I run |
|
Hmmm. What if little check errors out on an stdout mismatch first, would it also report any "incidental" stderr output that was generated simultaneously? |
|
Let's test! #RUN: %fish %s
echo foo
# CHECK: foo
echo stderr >&2
# CHECKERR: stderr
echo stderr2 >&2
echo right
# CHECK: wrong
echo stderr3 >&2
Run So it tells you that there is output on stderr, but only gives you the first line of it. You can also see this on Travis: (and yes, asan's first line of output is entirely useless) We could probably let it show more (all?) of it, or we write it to a logfile and let you check it? |
In #7459, asan printed error output. However, because we had a failure on stdout already, littlecheck would only print the first unmatched line from stderr, leading to output like ``` additional output on stderr:1: ================================================================= ``` Which is of course entirely useless. So in that case we just let it print *all* unmatched stderr lines, so you'd get the full asan output, which presumably is of more use. This upgrades littlecheck to 5f7deafcea4e58dd3d369eae069a3781bb6ce75e.
Drop need for explicit --import
Block import of read-only variables.
Add tests verifying read-only variables cannot be overwritten by regex import.
|
Thanks for clarifying the behavior, @faho. I think the problem is the use of line-based string comparison rather than diffing, which means there could (theoretically) be thousands of lines waiting to be matched you wouldn't want spewed to the console just because one stderr line failed to match, whereas with a diff-based approach you could be more certain that all the mismatch is relevant. The little-check diff patches will fix this, right? I pushed out the (final?) code updates, only documentation and changelog updates remaining I believe. |
Looks good, I don't think there's another blocker left. If multiple arguments are specified, a named capture will be extracted from the first argument that matches. |
|
Note: That's because it can't be wrapped correctly anymore after this PR - even with This is in line with |
|
The only thing I'm struggling with here is the description of the variables being "imported". To me that sounds like the value of the variable is brought into the match, rather than the results of the match being set externally. Is there a better way of phrasing it? Maybe as " |
|
"import" sort of makes sense as opposite of "exported" environment variables. So I'm in favor of rebranding, maybe something like this (for string-match.rst and the changelog):
|
Add support for importing named regex matches
The new commandline switch
string match --regex --importwill importas fish variables any named capture groups with the matched captures as
the value(s).
Example: