Powershell regular expressions
Regular Expressions
A regular expression is a sequence of logically combined characters and meta characters (characters with special meaning) that, according to parsing rules in the regexp engine, describes text which matches a certain pattern. I will assume you are familiar with both PowerShell and regular expressions and want to know how to use the latter in PowerShell. I have realized a lot of people don’t know regexp syntax/logic. There are quite a few resources on this on the web, but I will look into writing a generic tutorial that will cover some common cases and pitfalls. I have a Perl background myself, so I will make a few comparisons.
PowerShell uses the .NET regexp engine. Basically I’ve found its syntax to overlap with Perl’s in most respects.
Recommended Regular Expressions Book
I’ll go with the default regexp bible: Mastering Regular Expressions – often just referred to as "MRE". It’s published by O’Reilly. I read most of it some years ago and learned quite a bit (actually more than you’re likely ever to need about certain things).
Built-in Operators and cmdlets
The built-in operators -match and -replace can be used for quite a few simple, and complex, tasks. Respectively, they are superficially similar to Perl ‘s m// and s///g operators. -match will match case-insensitively by default, as will -replace. To match case-sensitively, use -cmatch and -creplace. Many recommend using the explicitly case-insensitive names, as in -imatch and -ireplace, for case-insensitive matching – to make it obvious that you made a conscious decision to match case-insensitively. If you pass in the "in-line" mode modifier "(?i)" or "(?-i)", this will override the operator differentiation, as I demonstrate later.
Examples
A few examples.
Example – The -match Operator
This first example uses the regular expression (\d+) and runs it against the string. The captures, indicated by the parentheses in the regexp, are stored in the hashtable $Matches. The first element, indexed by $Matches[0], contains the complete match (which will be the same in this case), while $Matches[n] contains the nth match, corresponding to $1, $2, $3 and so forth in Perl. So $Matches[1] contains the content of the first capture group, $Matches[2] the content of the second capture group, and so forth.
This will extract the first occurrence of digits (0-9) in sequence in a string. Rather than "\d", you could also have used the character class "[0-9]" – I am not aware of a locale with "other digits" included in "\d" (the built-in classes like \d and \w can respect locale).
If you have "surrounding" regexp parts around the captured part, they will be included in $Matches[0] which holds the complete regexp match. Below you will see that "abc" is now included in $Matches[0], because I added "[a-z]+", which means "match one or more characters from a to z, inclusively, match as many as you can". Read more about regexp character classes here. $Matches[1] is still "123".
PS E:\> 'abc123def456' -match '[a-z]+(\d+)' | Out-Null; $Matches[0,1] abc123 123
The result of -match will normally either be True or False, but this is suppressed here by piping to Out-Null. Note that if you use -match in an if statement, you should not pipe to Out-Null, as demonstrated below. This one bit me the first time around.
The expression in an if statement normally isn’t output, and if you pipe to Out-Null, the expression that will be evaluated is apparently indeed $null (which is considered false by the if statement):
PS E:\> if ((1 | out-null) -eq $null) { 'yes' } yes PS E:\> if ((1 | out-null) -ne $null) { 'yes' } PS E:\>
If you use a double-quoted string around the regexp part, variables will be interpolated. So if you, say, wanted to get an integer ("\d+" and "[0-9]+" are normally equivalent), after a space, after a string you have stored in a variable, you could do it like I demonstrate below. I would normally use "\s+" or "\s*" rather than a literal space, as it’s more robust, but sometimes you want to be precise, so it depends on your requirements. In this case, it’s just for readability.
PS C:\> $GetThis = 'SecretCode' PS C:\> 'foo bar baz SecretCode 42 boo bam' -match "$GetThis ([0-9]+)" True PS C:\> $matches Name Value ---- ----- 1 42 0 SecretCode 42
Example – The -NotMatch Operator
The -NotMatch operator works exactly like the -match operator, except it returns false for a match, and true if it does not match the provided regular expression:
PS C:\> 'fo', 'fooooo', 'foob', 'fooo', 'ding' | Where { $_ -NotMatch '^fo+$' } foob ding
Read more about the PowerShell Where-Object cmdlet here .
Example – The -replace Operator
I have not found a way to make -replace not replace all occurrences, so it’s equivalent to s///g in Perl by default. You can anchor to target the first and last match as demonstrated below. Also see the regex class method "replace" description for, among other things, a way to replace only once (the specified number of times).
Note: The regexp meta character "\A" means "really the start of the string", and matches only once, even if you specify the MultiLine regexp option, whereas "^" will match at the start of every line with the MultiLine option. The same respectively applies to "\z" and "$", except they mean the end of the string. In the cases where you only have one line, they are equivalent. So you could have used "\A" instead of "^" in the example below, and "\z" instead of "$". I used the most commonly known variants.
PS E:\> 'aaaa' -replace 'a', 'b' bbbb PS E:\> 'aaaa' -replace '^a', 'b' baaa PS E:\> 'aaaa' -replace 'a$', 'b' aaab # A little trick to target the last "a". More on this later in the capture/replace parts: PS E:\> 'aaaa' -replace '(.*)a', '${1}b' aaab
"$" also matches before a newline before the end of the line, whereas "\z" does not, as demonstrated here:
PS C:\> "bergen`n" -imatch 'gen$' True PS C:\> "bergen`n" -imatch 'gen\z' False PS C:\> 'bergen' -imatch 'gen\z' True
Example – Replace With Captures
When using -replace, you can use capture groups to alter content in text. The capture group(s) are indicated with parentheses, where you enclose the part you want to capture in "(" and ")". The so-called back-referenced (captured) parts are available in $1, $2, $3 and so on, just like in Perl. It’s important to use single quotes around the replacement argument, because otherwise the variables will be interpolated/expanded too soon, by the shell. You can also escape them between double quotes.
Here are a few quick examples:
PS E:\> 'Doe, John' -ireplace '([a-z]+)\s*,\s+([a-z]+)', '$2 $1' John Doe PS E:\> 'Doe, John' -ireplace '(\w+), (\w+)', "`$2 `$1" John Doe PS E:\> 'Doe, John' -ireplace '(\w+), (\w+)', "$2 $1" PS E:\>
If you find yourself thinking you’d like to process $1, $2, $3 and so on, "on the fly", you might want to check out the match evaluator part below.
Example – Named Captures
.NET supports named captures, which means you can use a name for a captured group rather than a number. You can mix and match and any non-named capture group will be numbered from left to right. To name a capture, use either the (?<name>regex) or (?’name’regex) syntax where name is the name of the capture group. To reference it, use a single-quoted string and the syntax ${name}.
If you are using a double-quoted string for the regular expression, you can use either syntax as described above, but if you are using a single-quoted string, like I usually do, you have to use doubled-up single quotes if you want to use the (?‘name’regex) syntax, like this:
PS C:\> 'foo bar baz' -match '(?''test''\w+)' | Out-Null; $matches['test'] foo
Here follow some examples.
Notice how the content of the first capture group is placed in $matches[1], the named capture is in $matches[‘named’] and the second capture group, although being the third group, is in $matches[2] since the named capture is stored using a name.
PS C:\> 'first namedcaptureword second' -match '(\w+) (?<named>\w+) (\w+)'; $matches True Name Value ---- ----- named namedcaptureword 2 second 1 first 0 first namedcaptureword second
Index a single named capture by name like this:
PS C:\> $matches['named'] namedcaptureword
Shuffle two occurences of non-whitespace sequences separated by a space:
PS C:\> [regex]$regex = '\A(?<putlast>\S+) (?<putfirst>\S+)\z' PS C:\> 'second first' -ireplace $regex, '${putfirst} ${putlast}' first second
Example – The -split Operator
To read a more thorough article about the -split operator, click here . There’s also some information about regexp character classes in that article. Among other things, it explains how to split on an arbitrary amount of whitespace.
The -split operator also takes a regular expression, so to, say, split on a number, you could do it like this:
PS E:\> 'first1second2third3fourth' -split '\d' first second third fourth
With Perl, you could do it like this (read more about Perl from PowerShell here ):
PS E:\> perl -wle "print join qq(\n), split /\d/, 'first1second2third3fourth';" first second third fourth
Example – Select-String
The cmdlet Select-String takes a regular expression and works much like grep or egrep in UNIX/Linux . The -context parameter is like a combined "-A" and "-B" for *nix’s grep, which adds the specified number of lines above and below the match.
In PowerShell v3 you also have a -Tail parameter for getting the last specified number of lines. With PowerShell v1 and v2, you can pipe to Select-Object and use either the -first or -last parameter, and specify the desired number of lines to display. Example: Get-Content usernames.txt | Select-Object -last 15. You can also just type Select rather than Select-Object, as it’s aliased (type "alias select" at the prompt).
Here are a few quick examples, based on a file which you can guess the contents of.
PS E:\> Select-String '[aeiouy]' .\alphabet.txt alphabet.txt:1:a alphabet.txt:5:e alphabet.txt:9:i alphabet.txt:15:o alphabet.txt:21:u alphabet.txt:25:y PS E:\> Get-Content .\alphabet.txt | Select-String '[aeiouy]' a e i o u y PS E:\> Get-Content .\alphabet.txt | Select-String 'i' -context 3 f g h > i j k l
If you want a boolean value (true or false) for whether there’s a match, you can use the -Quiet parameter:
PS C:\> 'hello' | Select-String -Quiet 'HellO' True
Example – Log Parsing
This is slightly more advanced and obscure. Let’s say you want to parse some irregularly structured error log and get the actual error strings as well as the numeric code in some preferred format. I made up a dummy log file format as demonstrated below.
001 | ERROR: This is the first error ::: Code: 400 002 | WARNING: Excess cheese. ::: Code: 200 003 | ErrOR: This is the second error. ::: Code: 401 004 | INFO: LoL ::: Code: 5
There are many ways to attack this. Here I demonstrate something halfway sane, where the aim is to have a highly flexible and broadly matching regexp. The regexp logic for the regexp in the code field below is as described in the table below that field.
^\s*\d+\s*\|\s*ERROR:\s*(.+):::\s*Code:\s*(\d+)
Anchor at the beginning of the string | ^ | Beginning of string, matching after every newline with the MultiLine regexp option. |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
A sequence of digits | \d+ | One or more digits, match as many as you can, greedily; so-called greedy matching. |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
A literal pipe character | \| | Escaped pipe character, making it a literal pipe – otherwise it’s effectively an OR. Escape using a backslash, not the PowerShell escape backtick character. |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
The string "ERROR:" | ERROR: | The literal string "ERROR:", including the literal ":". |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
Capture all non-newline characters | (.+)::: | Capture the error message by getting everything until you hit the literal string ":::". Technically, .+ goes to the end and then "backtracks" until it matches ":::", I assume you indeed want the greedy version and to get the last match if there’s an instance of ":::" in the error message. Use .+? for non-greedy matching. |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
The string "Code:" | Code: | The literal string "Code:" including the literal ":". |
Possibly whitespace | \s* | Zero or more whitespace characters, match as many as you can; always matches! |
Capture a sequence of digits making up the error code | (\d+) | One or more digits, match as many as you can, greedily; so-called greedy matching. |
The first match is stripped of trailing whitespace after the match, handily using -replace. This is difficult to avoid without using "\s+" before ":::" and thus making the regexp slightly less flexible.
Most regexp engines will match the first occurrence, even if it is not at the beginning of the string (you don’t need a complete match) as the .NET and Perl engines do, so you could omit the part before the escaped pipe. However, it adds some data integrity validation, by making sure the strings matched are in an expected format. The escape character in Powershell is a backtick (`), not a backslash (\), but in regular expressions, you escape using a backslash.
You could also have added something like "\s*\z" or "\s*$" at the end (they will be entirely equivalent in that case, because \s matches newlines), but it allows you to capture some potentially malformed data without it. It depends on the requirements when you design the regular expression; sometimes you want maximum data validation, sometimes you want it as broadly matching or inclusive as possible, and sometimes something in between.
PS E:\> Get-Content .\logfile.log | %{ if ($_ -imatch '^\d+\s*\|\s*ERROR:\s*(.+):::\s+Code:\s+(\d+)') { 'Error string: '+ ($matches[1] -replace '\s+$', '') + ' | Code: ' + $matches[2] } } Error string: This is the first error | Code: 400 Error string: This is the second error. | Code: 401
With Perl you could do it like this, using the same regular expression. Notice that I added the /i flag to the regexp for case-insensitivity. The "m" in "m//" can be omitted when using slashes as delimiters, as they are the default in Perl and this has been special-cased. Read more about Perl from PowerShell here .
PS E:\> perl -nwle 'if (m/^\s*\d+\s*\|\s*ERROR:\s*(.+):::\s*Code:\s*(\d+)/i) { ($errmsg, $code) = ($1, $2); $errmsg =~ s/\s+$//; print qq(Error string: $errmsg | Code: $code) }' .\logfile.log Error string: This is the first error | Code: 400 Error string: This is the second error. | Code: 401
Example – The -like Operator
It doesn’t technically support a regex, but the -like operator seems like it belongs in this article, and it does implement a sort of subset.
The special characters the -like operator recognizes are described in the following table:
? | One instance of any character. |
* | Zero or more instances of any character. |
[<Start>-<End>(<Start>-<End>)] | Range of numbers or characters. Multiple ranges are supported. |
So to see if a string starts with a hex digit (followed by anything or nothing), you could use this:
PS C:\> 'deadbeef' -like '[a-f0-9]*' True PS C:\> 'xdeadbeef' -like '[a-f0-9]*' False
To make sure a string starts with a letter between a-z and is followed by exactly one other character that can be anything, you could use:
PS C:\> 'v2' -like '[a-z]?' True PS C:\> 'v' -like '[a-z]?' False PS C:\> 'v22' -like '[a-z]?' False
Example – The -NotLike Operator
The -NotLike operator works exactly the same as the -like operator, except it returns false for a match and true when it does not match.
PS C:\> 'fo', 'dinner', 'foob', 'fooo', 'ding' | Where { $_ -NotLike 'fo*' } dinner ding
Read more about the PowerShell Where-Object cmdlet here .
Mode Modifiers
You might know mode modifiers from other languages, like "i" (case-insensitivity), "m" (multi-line, makes "^" and "$" match respectively at the beginning and end of every line) and "s" (makes "." also match newlines). "g" for global matching, such as used in Perl, is gloriously missing, and instead you have varying default behaviour that may or may not be desirable.
With the [regex]::match() and [regex]::matches() methods you can match once or globally. -match matches once by default. Most of the time, you can just use -match rather than [regex]::match().
I also describe a way of only replacing the specified number of times below, in the replace section. So you could use "1" for replacing only once, which is the same as not using global matching in other languages. So, actually, with .NET there is a way to have even more control than by using the optional "g" flag, where its presence means "replace all occurrences" and its absence means "replace first occurrence only" (a trick to match the last occurrence instead, is to add ".*" in front of the part to replace, because that will go to the end of the string and then "backtrack" until it finds a successful match for the remainder of the regular expression). Anchoring can also be used, as mentioned in the part about the -replace operator.
The modifiers, or regexp options, are of the type System.Text.RegularExpressions.RegexOptions as described in this Microsoft article . You can also inspect them with this command:
[System.Text.RegularExpressions.RegexOptions] | gm -static -type property
Mode modifiers override other passed options. The effect on case-sensitivity when using the operators -imatch or -cmatch is overridden by the modifiers "(?i)" and "(?-i)", which accordingly enable and disable case-insensitivity. You can use mode modifiers "inline" in the regular expression, which you compile, store in a variable and subsequently use – or you can specify it directly.
Here is a quick demonstration:
PS E:\> $regex = [regex]'(?-i)A' PS E:\> 'A' -imatch $regex True PS E:\> 'a' -imatch $regex False PS E:\> 'a' -cmatch '(?i)A' True PS E:\> 'a' -imatch '(?-i)A' False
Regex Class
Using the [regex] class (a shorthand / type accelerator for [System.Text.RegularExpressions.Regex]) allows you to do some interesting stuff, such as calling class methods like [regex]::matches, [regex]::match and [regex]::replace. It also allows you to use regexp mode modifiers passed in as regexp options (an array of strings).
Class Methods
Matches
Using [regex]::matches allows you to specify modifiers as parameters. Again, they will be overridden by "in-line" modifiers.
Quick demonstration of the MultiLine flag:
PS E:\> $str = "a123`nb456`nc789" PS E:\> [regex]::matches($str, '^(\w)') | % { $_.Value } a PS E:\> [regex]::matches($str, '^(\w)', 'MultiLine') | % { $_.Value } a b c PS E:\> [regex]::matches($str, '(?-m)^(\w)', 'MultiLine') | % { $_.Value } a PS E:\>
Unlike the -match operator, [regex]::matches matches "globally" by default, like you would with m//g in Perl. As demonstrated below. For a single match, you can use [regex]::match, or just use the -match operator.
PS E:\> [regex]::matches('abc', '(\w)') | % { $_.Value } a b c
Multiple regexp options need to be specified as an array, as demonstrated below. The syntax is @(‘Option1’, ‘Option2’, [...]).
PS C:\> [regex]::matches("a1`nb2`nc3", '^([A-Z])\d', @('MultiLine', 'Ignorecase')) | % { $_.Value } a1 b2 c3 PS C:\> [regex]::matches("a1`nb2`nc3", '^([A-Z])\d', @('Ignorecase')) | % { $_.Value } a1 PS C:\> [regex]::matches("a1`nb2`nc3", '^([A-Z])\d') | % { $_.Value } PS C:\>
To get the captured group, you need to reference the returned System.Text.RegularExpressions.Match object’s group property and index into the array. The first element will be the whole match, the second will be the first capture group, the third will be the second capture group, and so on. Basically, you can index into the array like using $1, $2, etc., meaning $_.Groups[1].Value is equivalent to $1. $_.Value will be the same as $_.Groups[0].Value.
Below is a demonstration of getting the first and second capture group.
PS E:\> [regex]::matches("a1`nb2`nc3", '^([A-Z])(\d)', @('MultiLine', 'Ignorecase')) | % { $_.Groups[1].Value } a b c PS E:\> [regex]::matches("a1`nb2`nc3", '^([A-Z])(\d)', @('MultiLine', 'Ignorecase')) | % { $_.Groups[2].Value } 1 2 3
Replace
The functionality in [regex]::replace overlaps with the functionality in the -replace operator.
Here’s a capture and replace.
PS E:\> [regex]::replace('second first', '([a-z]+) ([a-z]+)', '$2 $1') first second
Here is a demonstration of how it replaces globally (as many times as it matches), like s///g in Perl.
PS E:\> [regex]::replace('second first', '(?<vowel>[aeiouy])', '-${vowel}') s-ec-ond f-irst
To replace only once (or the specified number of times), you need to instantiate a regex object and call the Replace() method on it. Example that replaces three times:
PS E:\> [regex] $regex = '[a-z]' PS E:\> $regex.Replace('abcde', 'X', 3) XXXde
Here "abcde" is the input string, "X" is the replacement string and "3" is the number of times to replace. The regexp is "[a-z]".
To run it as a one-liner, you could do it like this:
PS C:\> ([regex]'\d').Replace('12345', '#', 2) ##345
Notice that unlike -replace and -ireplace, [regex]::replace is case-sensitive by default. You can add the same regexp options as for [regex]::matches, including the "IgnoreCase" flag.
PS E:\> [regex]::replace('sEcond fIrst', '(?<vowel>[aeiouy])', '-${vowel}') sEc-ond fIrst PS E:\> [regex]::replace('sEcond fIrst', '(?<vowel>[aeiouy])', '-${vowel}', 'IgnoreCase') s-Ec-ond f-Irst
As with [regex]::matches you specify multiple regexp options in an array in the format @(‘Option1’, ‘Option2’) and so on. Here’s a demonstration where we see both the options are used:
PS C:\> [regex]::replace("a1`nb2`nc3", '^([A-Z])\d', '${1}X', @('MultiLine', 'IgnoreCase')) aX bX cX
Escaping and Unescaping Regexp Meta-characters
There’s also a regex class method called Escape() and one called Unescape(), which escapes regexp meta characters in a string with a backslash – which is the regexp escape character, not the PowerShell escape backtick character (Escape() also escapes spaces).
In this example, I have an array of users, $Users, which contains users in the form <DOMAIN>\<USERNAME>. Now, if you wanted to see if another value in an array you’re looping over exists in this array, you could probably use the -Contains operator, but if you need or want a regexp for some reason, you will probably find yourself constructing something like this:
^(?:option1|option2|option3)$
The problem you will run into in this case, is that the domain\user structure contains a backslash, which means that when you pass it to the right-hand side of the -match (or -replace) operator, it will be interpreted as a special sequence. If the username starts with "s", you will get "\s", which is whitespace, and not what you want. If the escape sequence isn’t recognized/valid, you will get errors. To work around this, you can use the [regex]::Escape() method.
PS C:\> $Users = @('domain\foo', 'domain\bar') PS C:\> $Users domain\foo domain\bar PS C:\> [regex] $UserRegex = '^(?:' + ( ($Users | %{ [regex]::Escape($_) }) -join '|') + ')$' # And the finished regexp looks like this: PS C:\> $UserRegex.ToString() ^(?:domain\\foo|domain\\bar)$ # Now match PS C:\> 'domain\hello' -imatch $UserRegex False PS C:\> 'domain\foo' -imatch $UserRegex True PS C:\> 'domain\bar' -imatch $UserRegex True PS C:\> 'domain\barista' -imatch $UserRegex False
Match Evaluator
Here’s a demonstration of the MatchEvaluator feature, where you pass in a script block as the third argument. It translates vowels in the alphabet to uppercase, prepends a space and appends a number that increases by one per match/vowel, followed by a space for readability. This is pretty much equivalent to the /e flag in Perl (evaluate). From what I can tell, what is equivalent to $matches[0] is passed in as the Value property of the first argument to the script block, available in $args. You can also use the param keyword like with functions and scripts ( { param($whatever) $whatever.Value.ToUpper() } ).
I can add that $args[1] will be empty (at least it has been during my tests) and that $args[0] contains an object of the type [System.Text.RegularExpressions.Match] as described in this Microsoft article . You can assign it to a variable in the script block for later inspection with Get-Member (alias gm) and the likes.
PS E:\> $counter = 0 PS E:\> [regex]::replace($alphabet, '[aeiouy]', { ' ' + $args[0].Value.ToUpper() + ++$counter + ' ' }) A1 bcd E2 fgh I3 jklmn O4 pqrst U5 vwx Y6 z
You pass in (multiple) regexp options after the script block, as for the matches() and replace() methods. Demonstrated here:
PS E:\temp> $counter = 0 PS E:\temp> [regex]::replace("ab`nef`nij", '^[AEIOUY]', { $args[0].Value.ToUpper() + ++$counter + ' ' }, @('IgnoreCase', 'MultiLine')) A1 b E2 f I3 j
For the heck of it, I’m adding a Perl version that you can run from PowerShell (does not work from cmd.exe due to quoting differences, and of course $alphabet isn’t very useful in cmd.exe):
PS E:\> $alphabet | perl -pe 's/([aeiouy])/q( ) . uc $1 . ++$counter . q( )/ige' A1 bcd E2 fgh I3 jklmn O4 pqrst U5 vwx Y6 z
Read more about Perl from PowerShell here . And here’s a cmd.exe version anyway:
C:\>perl -e "print a..z" | perl -pe "s/([aeiouy])/q( ) . uc $1 . ++$counter . q( )/ige" A1 bcd E2 fgh I3 jklmn O4 pqrst U5 vwx Y6 z
I came across a scenario where this feature seems tremendously useful, namely in decoding HTML entities to characters in a string. I had to add a space between the "&" and the "#" in order for the character not to be displayed as a quote here in the wiki. In the real string, there are no such spaces...
PS C:\> Add-Type -AssemblyName System.Web PS C:\> [System.Web.HttpUtility]::HtmlDecode('& #x22;') " PS C:\> $String = "Here's a & #x22;quoted& #x22; word" PS C:\> $DecodedString = [regex]::Replace($String, '&[^;]+;', { [System.Web.HttpUtility]::HtmlDecode($args[0].Value) }) PS C:\> $DecodedString Here's a "quoted" word PS C:\>