On code optimisation
As organisations move towards more automation, those that use Windows-based environments encourage learning PowerShell. Our organisation is one of these. This morning, we received a challenge:
Before you leave for the holidays, the Elves in Accounting just need you to fix your expense report (your puzzle input); apparently, something isn’t quite adding up.
Specifically, they need you to find the two entries that sum to 2020 and then multiply those two numbers together.
For example, suppose your expense report contained the following:
1721 979 366 299 675 1456
In this list, the two entries that sum to 2020 are 1721 and 299. Multiplying them together produces 1721 * 299 = 514579, so the correct answer is 514579.
Of course, your expense report is much larger. Find the two entries that sum to 2020; what do you get if you multiply them together?
The list had 200 numbers. In my head, the process was simple:
- Iterate through each of those numbers
- Create an iteration within this iteration to iterate through the numbers
- Get the sum of the elements in the current iteration
- If the sum is 2020, get the product of the two elements
- Output the product
That means the code would be something like this:
$First = Get-Content "$PsScriptRoot\input.txt"
foreach ($FirstNumber in $First) {
foreach ($SecondNumber in $First) {
if ($FirstNumber + $SecondNumber -eq 2020) {
$FirstNumber * $SecondNumber
}
}
}
But you would receive no output. Because Get-Content
reads each line of the file as a string, and the +
operator in case of a string concatenates; we want it to add two integers.
All you need to do is make a minor change in the read operation:
[int[]]$First = Get-Content "$PsScriptRoot\input.txt"
This explicitly tells PowerShell to read each line in the text file as an integer, instead of a string. The operator used, [int[]]
, is the casting operator—you “cast” the strings you read as integers.
When you proceed with it now, you will get two products (assuming that there is precisely one right answer in the list).
Given that both the output products are the same number, this script solves your problem. But there is one that most will overlook, like I did:
Efficiency.
When absent-mindedly scrolling through my social feed after work, I realised that what I had written could not be efficient code. Not merely because the script displayed the product twice, but given that you have two hundred elements in the list and we add two of the elements to get 2020
, the number of iterations to go through would be 200 ^ 2, which is 40,000. To verify, you could add a variable, increment it when processing each pair and output the value in the end:
[int[]]$First = Get-Content "$PsScriptRoot\input.txt"
$PairsProcessed = 0
foreach ($FirstNumber in $First) {
foreach ($SecondNumber in $First) {
Write-Host "$FirstNumber | $SecondNumber"
if ($FirstNumber + $SecondNumber -eq 2020) {
$FirstNumber * $SecondNumber
}
$PairsProcessed++
}
}
$PairsProcessed
40,000 multiplication operations is perhaps not much. But the problem with the approach is:
- Each element gets added to itself, which could lead to an issue—what if 1010 were a number in the list? Unless there is another 1010, the result would be wrong.
- This approach introduces redundant combinations.
For example, if we picked four elements, 1664, 1909, 1638 and 1904, the iterations using the script above would yield:
1664 | 1664
1664 | 1909
1664 | 1904
1664 | 1638
1909 | 1664
1909 | 1909
1909 | 1904
1909 | 1638
1904 | 1664
1904 | 1909
1904 | 1904
1904 | 1638
1638 | 1664
1638 | 1909
1638 | 1904
1638 | 1638
The real useful combinations are:
1664 | 1909
1664 | 1904
1664 | 1638
1909 | 1904
1909 | 1638
1904 | 1638
From 16 pairs, all the way down to 6.
The idea is to avoid reprocessing an element that we have already processed. This way, we create combinations—combinations as in “Permutations and Combinations”.
To achieve this, let us add another array variable, called, ProcessedElements
to keep track of the processed elements. We update this list as soon as we enter the first iteration, so that we can exclude these elements in the iteration within it:
[int[]]$First = Get-Content "$PsScriptRoot\input.txt"
$TotalPairsProcessed = 0
$ProcessedElements = @()
foreach ($FirstNumber in $First) {
$ProcessedElements += $FirstNumber
foreach ($SecondNumber in ($First | Where-Object { $PSItem -notin $ProcessedElements })) {
if ($FirstNumber + $SecondNumber -eq 2020) {
$FirstNumber * $SecondNumber
}
$TotalPairsProcessed++
}
}
$TotalPairsProcessed
The total pairs processed came down to 19,900. Perfect. That is the number of combinations, when you use its mathematical formula on the data you have:
200! / (2! × (200 – 2)!) = 19,900
Great. This time I thought I’d measure the time that the script needs to execute. 2010.3611 milliseconds. Long time, I thought. But I could reduce it further. What if the thirteenth element and the seventy-fifth element combined to give 2020
? Once I had the answer, I could save myself thousands of iterations by skipping the rest of the set. And so, I told the script to exit once we found our answer—the assumption was that we have one right answer: no more, no less.
[int[]]$First = Get-Content "$PsScriptRoot\input.txt"
$ProcessedElements = @()
foreach ($FirstNumber in $First) {
$ProcessedElements += $FirstNumber
foreach ($SecondNumber in ($First | Where-Object { $PSItem -notin $ProcessedElements })) {
if ($FirstNumber + $SecondNumber -eq 2020) {
return $FirstNumber * $SecondNumber
}
}
}
When I optimised the code to do no more than meet my requirements, the processing time came down all the way to 292.6665 milliseconds—a reduction of a whopping 85.44%. Smiling, I thought of a blog post that I had read long ago: Software disenchantment by Nikita Prokopov.
Then, I thought, ‘How do these numbers compare to the original solution?’ I ran the Measure-Command
cmdlet on the initial script. I was in for a shock:
92.5045 milliseconds.
How did that happen—200 milliseconds less than the optimised code?
Because the optimised code was not optimised after all: in the iteration for the second operand, we perform a Where-Object
operation, which needs more computing. An inefficient approach. To reduce computing, we ought to remove the elements that we have processed, from the array, instead of filtering the array.
An array object has a built-in method called Remove()
. But if you use it in this context, you would get an error saying, Exception calling "Remove" with "1" argument(s): "Collection was of a fixed size."
. To remove elements from the array and avoid this error, we use the System.Collections.ArrayList
type instead of System.Array
.
To make PowerShell see the numbers as integers, and at the same time, treat the list as an ArrayList
, we cast the integer array as an ArrayList
, like so:
[System.Collections.ArrayList][int[]]$First = Get-Content "$PsScriptRoot\input.txt"
To make sure we have the right object type and that the array is no more fixed size, we test using the GetType
method on an element, and the IsFixedSize
property of the ArrayList
object:
# Get the object type of any one element
$First[0].GetType()
# See if the array is of fixed size
$First.IsFixedSize
Which gives you Int32
and False
. Perfect.
But now, we would need to change the way we iterate, because, if you use a foreach
loop while removing elements as you go, you would get an error in the end, saying, Collection was modified; enumeration operation may not execute.
This is because a foreach
loop will not know how to gracefully handle the last element. We change the looping construct to the following and measure how long the script takes to run:
[System.Collections.ArrayList][int[]]$First = Get-Content "$PsScriptRoot\input.txt"
do {
for ($i = 1; $i -lt $First.Count; $i++) {
if ($First[0] + $First[$i] -eq 2020) {
return $First[0] * $First[$i]
}
}
$First.Remove($First[0])
} until ($First.Count -eq 0)
We start with the first element in the array and add it to each element in the array starting from the second. Once done, we remove the first element and pop the second as the first. We continue this way until the array empties.
Of course, using two for
loops is more graceful. The first for loop starts from the first element and goes until the second-to-last element, while the second loop starts from the element after the element selected by the parent for loop and runs until the last element.
54.5861 milliseconds. A reduction of 40.99%. This is as far as I could get in one sitting.
Although, remember that results may vary in different environments. On Windows PowerShell, processing 19,900 objects took 1465.76 milliseconds, the script with a return
took 256.6379 milliseconds, the script with no optimisation took 33.792 milliseconds and the script that uses the Remove
method of ArrayList
took 51.145 milliseconds.
If you did not notice, the script with no optimisation took the least time in Windows PowerShell on a PC with the same configuration as my Linux one. Languages behave differently. Their methods are optimised based on what the makers of each language decide is its use case. Our endeavour as engineers should be to optimise our code for performance, based on the behaviour of the language, the task at hand, and the environment.
Of course, most of the time, the amount of time spent on trying out different approaches to optimise the code outweighs the amount of time you would save with any approach—even with a million rows in the input file in this situation. But that is no excuse for sloppy scripting. It comes down to user experience. You might bring down the running time from 1.34 seconds to 0.34 seconds, by spending four hours optimising the code, but the end user would see it differently—a blink of the eye versus a deep breath.
Make the code do no more than what is necessary.