Well, it's certainly been a while since I posted on here. Trying to get back into the swing of things for 2020 now that I'm mostly settled in at my new job. So, let's take a look at a classic problem that never fails to confuse folks just getting into PowerShell. I've also seen plenty of more experienced folks trip over it quite a lot, so here we are.
Before we begin, many thanks to Chris Dent for asking me to put this together and the basis for the recommendations at the end!
Background Information
The term collection
is often bandied about, so let's define it concretely before we get started, shall we?
In general parlance, a collection is generally considered to be a pretty broad term.
In its simplest usage, it just refers to a set of individual items.
Oftentimes, these items may be related to one another in some way.
Background - Collections in .NET
In the .NET ecosystem, the term collection
has a pretty well-defined meaning.
A "collection" is any type of object that implements the ICollection
interface.
Seems a bit circular, right?
Let's look at what that actually means and take a quick refresher on what an interface
actually is in .NET for those of you who're not familiar.
From the C# Programming Guide's page on interfaces, we get this definition:
An interface contains definitions for a group of related functionalities that a non-abstract class or a struct must implement.
By using interfaces, you can, for example, include behavior from multiple sources in a class.
That's a bit of a wordy definition, but it does the job. In practical terms, an interface is a way of defining a required set of properties or methods that must be defined by any other type that wants to inherit from or implement this interface. It's a way of being able to define standard ways of interacting with more than one kind of object.
ICollection
ICollection
is a fairly simple interface that has a few demands:
- Methods
- CopyTo(Array array, int index)
- Properties
- Count
- IsSynchronized
- SyncRoot
Not a whole lot, huh?
So for a type to be an ICollection
, it needs to guarantee these things:
- It has a
CopyTo()
method that you can use to copy its items into an array of your choosing starting from the requested index. - It has a
Count
property that will give you the current number of items in that collection. - It has the
IsSynchronized
andSyncRoot
properties. For collections that utilise this functionality, these can be used to determine whether a collection's state is synchronised across threads. One example of a collection that allows you to create synchronised versions, you can check out Hashtable and its static method Synchronized().
Fairly straightforward on the whole, I suppose.
Interfaces can get quite complex, and there are two main extensions of ICollection
in use in .NET Core: IList
and IDictionary
IList
types represent a "flat" collection, where it's simply an ordered list of items.
On the other side of things, IDictionary
represents a less-strictly-ordered set of key/value pairs where a given key (often a string such as a name) is used to access the relevant item directly.
Collections in PowerShell
The main type of collection in use in PowerShell is the humble Array
, at least in terms of what is often available directly from a script.
Behind the murky curtain, PowerShell actually utilises the System.Collections.ArrayList
type quite heavily in its pipeline processor.
Partially as a result of that, arrays (and more specifically object[]
) are the backbone of PowerShell scripting.
In the vast majority of cases, you're going to be handling arrays unless you deliberately choose to use another type of collection to handle output.
Let's take a look at some fairly common methods of handling collections of items in PowerShell, shall we?
Common Patterns
Use of +=
to Build Arrays
$array = @()
foreach ($value in 1..5) {
$array += $value
}
Seems pretty simple and sound, at least at first glance. So… spot anything amiss?
The pattern here is "create the collection object, and then add each item to it one at a time".
This works quite well for some collection types, but not so well for others.
In fact, for arrays specifically it's technically not possible to do this — arrays are actually fixed-size collections.
If you call $array.Add(10)
you'll get an error stating as much.
But this still works, somehow, and it's all thanks to some ✨ PowerShell magic! ✨
ℹ Behind the Scenes
PowerShell's
+
and+=
operators are designed to work with arrays in a relatively unusual way. When you try to add items to an array like this, what actually happens goes something like this:
- PowerShell checks the size of the collection in
$array
and the number of items being added to it (in this case, just one each time).- PowerShell creates a completely different array of the correct size.
- The original array is copied into this new array, along with the new item(s).
This is also why it's perfectly possible to join two arrays together with the
+
or+=
operators.
What's Wrong with this Picture
The missing piece here, and "what's wrong" in some ways is that this operation can become prohibitively expensive. For smaller arrays, it's not a concern at all. The main problem is that we won't always know ahead of time how big our collection might become.
If $array
is quite large (say somewhere in the realm of 5,000-10,000 items), then generating new arrays every single time you add just a couple of items becomes very expensive to do.
A new array, even if it's empty, requires allocation of memory so that the array has enough space to exist in a single region of memory.
Adding a single item to a 10,000 item array means reserving a new block of memory, enough to hold another 10,001 items, then copying everything across.
Notice that I didn't mention anything about deleting the original array. This is because PowerShell can't necessarily assume you're definitely wanting to completely remove the original array. You might have another variable storing a reference to that same array in a completely different scope, after all. Checking for that every time you want to add things to an array would also be prohibitively expensive.
So what happens, then? The original array will sit around in memory until the .NET garbage collection routines realize that it's not needed by anything anymore. In some cases, that can mean it sits around for quite a while before that memory is freed up again.
For smaller collections, of course, this isn't really an issue at all. However, this is a very easy pattern to fall into, so it can come back to bite you when you least expect it. Let's look at a few other ways to approach this task.
Using an Expandable List
.NET contains quite a few types of lists.
If you look at the types that implement IList
you'll see there are quite a few available options.
Oftentimes older blog posts or code examples will have ArrayList
as their collection of choice.
Personally, I've never liked using ArrayList directly.
As I mentioned above, PowerShell uses it behind the scenes extensively — but I'd be willing to bet that if they had the time available, more than a few people on the PowerShell team would love to replace it with something more modern.
I know I would; ArrayList comes from the very old days of .NET, hailing all the way back to .NET v1 if I recall correctly.
In fact, the documentation page for ArrayList warns against its continued use:
ℹ Important
We don't recommend that you use the ArrayList class for new development. Instead, we recommend that you use the generic
List<T>
class. The ArrayList class is designed to hold heterogeneous collections of objects. However, it does not always offer the best performance. Instead, we recommend the following:
- For a heterogeneous collection of objects, use the
List<Object>
(in C#) orList(Of Object)
(in Visual Basic) type.- For a homogeneous collection of objects, use the
List<T>
class. See Performance Considerations in theList<T>
reference topic for a discussion of the relative performance of these classes. See Non-generic collections shouldn't be used on GitHub for general information on the use of generic instead of non-generic collection types.
So, for completeness' sake, I'll include an example of what using an ArrayList looks like — but I wouldn't recommend ever using it.
# There are multiple ways to create the original ArrayList, but this is one of the simplest and easiest to use.
[System.Collections.ArrayList]$list = @()
foreach ($value in 1..5) {
# Redirection to null is necessary as ArrayList outputs an index number for each added item.
$list.Add($value) > $null
}
So, let's say we follow that recommendation from the ArrayList documentation, and want to use a generic List instead.
What even is List<T>
anyway?
List<T>
is a generic type, which in simple terms just means it has a bit of a fluid definition.
You can define a List<T>
in terms of what type of object you want it to work with or (in the case of List, at least) store, where the T
is the type of object you're storing in it.
For a list comprised entirely of numbers, we might use List<int>
to store them.
Why would we use this over ArrayList? In PowerShell there isn't a massive difference here, but for me personally it comes down to a few things.
- You can create a List that only stores a specific type, which means you can catch mistakes earlier. Adding other types of items to that list will either have them converted to the List's designated type, or will throw an error stating that it was the wrong type.
- Using the
Add()
method onList<T>
doesn't create incidental output likeArrayList
'sAdd()
method does, so you have less chance to miss things and pollute your output stream. - Given the assertion in the documentation, no less, that ArrayList is recommended against, I would assume two things:
- There is the possibility at any point that the .NET / .NET Core team will eventually deprecate it completely and possibly even remove it from .NET at that point or afterwards.
- No further work is being done on
ArrayList
, and if there are any improvements to be had you will likely only find them inList<T>
.
Given that, I'd choose to use List<T>
instead of ArrayList in every single case.
When using it, there's one question you need to ask before using it:
What kind of objects am I storing in this list?
If you know ahead of time that you plan on storing more than one type of item in the list, you have two choices:
- Use a common parent type or interface, for example
System.Exception
is the parent type of any Exception-typed object in .NET, so using that base type allows you to store any other Exception object. - Use the
object
type, which is the parent type for everything in .NET.
In our previous examples, we're simply storing numbers. Let's see how that looks with a generic List, shall we?
[System.Collections.Generic.List[int]]$list = @()
foreach ($value in 1..5) {
# No redirection needed here, List<T> does not emit data from Add()
$list.Add($value)
}
That certainly looks a little tidier.
The long type name is a bit much at times, but at least in PowerShell 5.1 and up, you can use using namespace System.Collections.Generic
to let you just use [List[int]]
if you prefer.
📝 Note
While the .NET documentation frequently uses the C# syntax
List<T>
, when we get to PowerShell we actually use square brackets to indicate the generic type parameter instead.List<T>
instead becomes[List[T]]
, and in practical usageT
is replaced with another type name. This is purely a syntactical difference between C# and PowerShell — in both cases we're referring to the same .NET type. In F# you'd writeList<'T>
and in Visual Basic it looks instead likeList(Of T)
.
Why Use a List
Lists are fantastic for cases where you don't know the collection size ahead of time. If you knew the size ahead of time, you can simply pre-allocate an array of the correct type and size anyway:
$array = [int[]]::new(5)
for ($index = 0; $index -lt 5; $index++) {
$array[$index] = $index * 2
}
In PowerShell, they're best used in cases where you need to be adding to the collection from multiple places in your code, and you want the order of items to stay the same as the order you put them into the collection. They're also the most effective way to build multiple collections in a single loop statement, if that's something you need to do.
Using the Pipeline
I mentioned earlier that PowerShell already heavily uses ArrayList
in its pipeline processor.
We can actually take advantage of this without ever needing to use ArrayList ourselves directly.
This style of building collections looks a lot more like coding in a functional language than the other options, so it may confuse some folks who aren't accustomed to coding this way.
However, it's perfectly valid and quite flexible in nature.
The above examples coded this way would look a little more like this:
$array = foreach ($value in 1..5) {
# This value is not stored anywhere and becomes part of the "output" from the loop statement.
$value + 4
}
If you're not familiar with this kind of pattern, you're probably wondering how on Earth that even works.
In PowerShell, unlike most .NET languages (and indeed many other programming languages in general), allows you to assign a variable to the "result" of a statement, regardless of whether that statement is a keyword-based loop or a complete pipeline.
Any uncaptured output will then be funneled into that variable and stored as [object[]]
(an array of objects) in the variable.
Because this utilises the built-in PowerShell pipeline (which, remember, uses ArrayList to handle output), there is no additional overhead of creating another type of collection here. We're just using what PowerShell has always been ready to provide for us. This is also one of the simplest ways to keep your code highly compatible across PowerShell versions with minimal effort.
ℹ Compatibility
This pattern works with all released versions of PowerShell, even back to version 1. For compatibility, it's the absolute best option. And thankfully, it's one of the most effective options as well!
This pattern works with all PowerShell control-flow statements, which include switch
, foreach
, for
, do..until
, do..while
, while
, and try..catch..finally
(and there's probably a couple I forgot to mention, but will still work as well!)
It also works seamlessly with pipelines, no matter how lengthy they get:
$pictures = Get-ChildItem -Filter *.png -Recurse -File |
Where-Object Length -gt 1MB |
Select-Object -Property Name, Length, FullName
Working With Collections
New-Object
v.s. ::new()
v.s. Casting
This is more of a general point to make about creating new objects in PowerShell, but if you're not aware it bears mention here too.
New-Object
is, not to put too fine a point on it, quite slow.
At least compared to alternatives.
Syntax like $list = [System.Collections.Generic.List[int]]::new()
is fairly new, first available in PowerShell 5.1.
If compatibility is your main concern, ::new()
is not the first choice, to be sure.
However, New-Object
should still be the last choice, generally being reserved for COM objects and similar otherwise difficult to construct items.
Syntax $list = [System.Collections.Generic.List[int]]@()
or [System.Collections.Generic.List[int]] $list = @()
will work all the way back to version 2 of PowerShell.
For a compromise between compatibility and speed I'd go for one of those.
Adding Many Items to an Existing Collection
If you have a List collection and want to add multiple items to it quickly, rather than using any kind of loop you can actually add them in directly.
List<T>
has an AddRange()
method that allows you to add multiple items at once:
$List = [System.Collections.Generic.List[object]]@(1..5)
$List.AddRange(5..10)
# Note - for strongly-typed lists you will need to cast the array you're adding to the relevant array type explicitly:
$List = [System.Collections.Generic.List[int]]@(1..5)
$List.AddRange([int[]](5..10))
General Recommendations
Taking all the above into account, we can establish some broad recommendations when handling collections of items in PowerShell.
- If you can simply assign the statement to a variable to build your collection, do so. It's the most compatible approach and one of the simplest ways to handle it.
- If for some reason you cannot assign the statement or need to build multiple collections in the same loop, use a
List<T>
. - If you have a scenario where you need to add multiple items to a collection, use a
List<T>
and make short work of it withAddRange()
.
I went into this post thinking it'd be short and sweet, but I really can't help but explain things to the nth degree, can I? 😂
That's all I have for now, I think. If you have any questions or comments, feel free to leave a comment below.
I'm also keeping an eye out for some more content to cover to get back into regular blog posts, so if there's something I haven't already covered that you're needing some explanation on and think I can help, give me a shout on Twitter or one of the community channels. Have a great day! 😊