Using Named Regex Matches to Build PSCustomObjects

Regex is often considered something of a black art, and not without reason. It is, arguably, the antithesis to PowerShell in terms of syntax. It's terse, unforgiving, and difficult to get meaningful debug data out of. However, sometimes you just have to parse text, and there often is no better tool than some well-applied regex.

Text Parsing is Messy; Objects Are Not

There's no way around it, really. At some point when parsing text, the code gets messy. Personally, I like to constrain the awful bits to regex, and make the most of it. Its terseness becomes an advantage here, as it contains the mess in one small spot, rather than resulting in large blocks of crude, messy parsing code.

There are a lot of ways to cram otherwise messy text into an object in PS. You can manually parse with or without regex, extracting data one painful piece at a time to build your object. You can use ConvertFrom-String or ConvertFrom-StringData (a personal favourite of mine).

But using the built-in language features of PowerShell's regex engine, which originates from the .NET libraries, is perhaps the most effective and simple way to go.

The Setup

Let's say we're trying to parse the output from the Windows netstat command, which looks like this:

PS ~\> netstat

Active Connections

  Proto  Local Address          Foreign Address        State
  TCP    127.0.0.1:2002         WS-JOEL:51464          ESTABLISHED
  TCP    127.0.0.1:5354         WS-JOEL:49695          ESTABLISHED
  TCP    127.0.0.1:5354         WS-JOEL:49696          ESTABLISHED
  TCP    127.0.0.1:27015        WS-JOEL:51470          ESTABLISHED

We could parse this with a whole bunch of $string.Split('`t') methods and a while loop, but cramming all that into a custom object would leave us with messy and difficult to read and review code.

Parsing with Regex

The ultimate goal here is to end up with an array of custom objects that we can emit to the pipeline, and to remove any unusable munged data. A basic regex pattern capable of parsing the netstat output would look something like this:

(\w+)\t(([0-9]+\.){3}[0-9]+):([0-9]+)\t([\w\d_-]+):([0-9]+)\t(\w+)

Okay, wow, what a mess. We could stop this pattern here and deem it "good enough," but it would likely need at least five lines of comments to properly document what that pattern is doing, so that it's recognisable at a glance. Let's improve this with a few named match groups:

Named Matches

$MatchPattern = @(
    '(?<Protocol>\w+)'
    '(?<LocalAddress>(?:[0-9]+\.){3}[0-9]+):(?<LocalPort>[0-9]+)'
    '(?<RemoteAddress>[\w\d_-]+):(?<RemotePort>[0-9]+)'
    '(?<State>\w+)'
) -join '\s+'

Notice that due to the length of the string I have split it with a common delimiter here and opted to have it programatically joined into a single match string with the missing \s (whitespace) characters that are also a necessary part of the pattern. This is an optional step that lends us some extra readability in the match pattern.

Try Before You Buy

It's always a good idea to check your pattern against the string you want to match, to see what happens.

# String copied from NETSTAT output
$String = '  TCP    192.168.22.144:51546   vs-in-f188:5228        ESTABLISHED'
$String -match $MatchPattern

# Output
True

Okay, great! Now let's check the $Matches variable. This is automatically populated when doing a -match operation on a single string.

$Matches

# Output
Name                           Value
----                           -----
Protocol                       TCP
RemotePort                     5228
LocalPort                      51546
State                          ESTABLISHED
LocalAddress                   192.168.22.144
RemoteAddress                  vs-in-f188
0                              TCP    192.168.22.144:51546   vs-in-f188:5228        ESTABLISHED

Interesting. You can see that all our requested match groups are there, plus one extra, which is the full matched string. We're halfway there.

Let's Get Down to Business

$Matches is a [hashtable], and in PowerShell we can convert this directly to [PSCustomObject]. However, in a case like this we're not particularly interested in the full string that gets matched, since that's basically just our original data. Instead, we'd much rather trim out the extra values and just convert the result.

Making use of output from netstat itself, this is one possible method of making it happen:

$Pattern = @(
    '(?<Protocol>\w+)'
    '(?<LocalAddress>(?:[0-9]+\.){3}[0-9]+):(?<LocalPort>[0-9]+)'
    '(?<RemoteAddress>[\w\d_-]+):(?<RemotePort>[0-9]+)'
    '(?<State>\w+)'
) -join '\s+'

$Connections = netstat | ForEach-Object {
    if ($_ -match $Pattern) {
        $Matches.Remove(0)
        [PSCustomObject]$Matches
    }
} | Select-Object -First 5

$Connections | Format-Table

# Output
RemotePort LocalPort State       LocalAddress RemoteAddress Protocol
---------- --------- -----       ------------ ------------- --------
51464      2002      ESTABLISHED 127.0.0.1    WS-JOEL       TCP
49695      5354      ESTABLISHED 127.0.0.1    WS-JOEL       TCP
49696      5354      ESTABLISHED 127.0.0.1    WS-JOEL       TCP
51470      27015     ESTABLISHED 127.0.0.1    WS-JOEL       TCP
5354       49695     ESTABLISHED 127.0.0.1    WS-JOEL       TCP

The Format-Table is simply for display here, as custom objects with more than 4 properties output in list format by default.

Caveats

As always, each approach has its share of potentially-undesirable results. Most immediately obvious is that the order of the properties is not preserved, because $Matches is a hashtable. If we want to define a specific display order, we have two fairly simple options.

One is to insert a PSTypeName property and add a formatting hint for that type name in order to specify the order the properties are displayed in. More on that can be found in this post by Kevin Marquette.

The other option is to define a class with these properties and cast the hashtable to that class type instead of [PSCustomObject].

In both cases you can add other formatting hints, such as properties to avoid displaying by default.

Using Classes

class Connection {
    [datetime] $Timestamp
    [string] $Protocol

    [string] $LocalAddress
    [int] $LocalPort

    [string] $RemoteAddress
    [int] $RemotePort

    [string] $State

    Connection() {
        $this.Timestamp = Get-Date
    }
}

$Pattern = @(
    '(?<Protocol>\w+)'
    '(?<LocalAddress>(?:[0-9]+\.){3}[0-9]+):(?<LocalPort>[0-9]+)'
    '(?<RemoteAddress>[\w\d_-]+):(?<RemotePort>[0-9]+)'
    '(?<State>\w+)'
) -join '\s+'

$Connections = netstat | ForEach-Object {
    if ($_ -match $Pattern) {
        $Matches.Remove(0)
        [Connection]$Matches
    }
} | Select-Object -First 5

$Connections | Format-Table

As you can see, it's relatively similar to working with a [PSCustomObject]. In essence, as long as the properties you're trying to set are publily settable, and the object type you're casting to has a default public constructor (i.e., one that takes no parameters) you can cast a hashtable to it.

If there is an appropriate .NET type the fits the bill, you can even cast to that, should you see the need to do so.

Not all data is sufficiently consistent for regex to yield meaningful results, but most data sources can be regexed. However, in a lot of cases there are more effective and quick methods to get the data in PowerShell.

Thanks for reading!