LINQ Extensions 3 – Batch into sub-sequences
After last weeks post on extracting elements out of a list by minimum or maximum keys Ody Mbegbu mentioned on Google+ how he feels that something LINQ is missing is the functionality to batch, page, or divide a sequence into sub-sequences of a given size.
That is what we are going to look at today!
In fact, Ody already posted a method doing what we want:
static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> query, int batchSize)
{
int count = query.Count();
int batchCount = count <= batchSize
? 1
: (count % batchSize == 0
? (count / batchSize)
: (count / batchSize + 1)
);
for (int i = 0; i < batchCount; i++)
yield return query
.Skip(i * batchSize)
.Take(batchSize);
}
We will take this method as a starting point, and explore if and how it can be improved.
A slight modification
First, note how code calculating the number of batches is quite complicated. It handles three different cases:
- less than one full batch;
- last batch is exactly full;
- last batch is only partly full.
The method will always return the minimum number of batches needed to contain all elements, with the exception of an empty input. In that case, a sequence containing an empty sequence will be returned.
While that may be desired behaviour, this is something I will change. Doing so will allow us to simplify both this and the rest of the code easier. Of course, we could always re-add the special case, if we like.
In my version, the method will simply return an empty sequence for an empty input.
Calculating number of batches
With that out of the way, note how the number of batches will always be the length of the input, divided by the batch size, rounded up.
Unfortunately, there is no such thing as rounding up with integer division. We can easily simulate it however. All that is needed is to see that rounding up really means to always add 1 to the result, except for when the division has no remainder. Instead of deciding between the two cases using a conditional statement however, we can add (batchSize - 1)
before dividing, which results in exactly the desired behaviour:
var batchCount = (count + batchSize - 1) / batchSize;
Analyzing runtime
The above is really just a minor quibble however.
The real meat of this post is looking at the runtime of the presented method.
Note how it enumerates the input once to determine its count, and then once more for each batch.
That means that the method’s runtime is O(n + nk) with n the size of the input and k the number of batches. The latter is really a linear function of n, so the worst case performance of the method is O(n + nO(n)) or O(n2).
This seems less than optimal. After all, in the end we return each item of the input only a single time – albeit differently organized.
Optimizing runtime – from square to linear
The former intuition is in fact correct. We can make the method run in linear time easily enough. All we have to do is make sure we do not enumerate the entire list, every time we Skip()
the previous batches.
As a first step, we can convert out input into a List<T>
using LINQ’s ToList()
which we can then access directly using indices:
Unfortunately, as it turns out, the problematic Skip()
not actually optimized to take advantage of the collection type. We could easily enough write out own optimized version, however, if we like:
static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> query, int batchSize)
{
var asList = query.ToList();
var batchCount = (asList.Count + batchSize - 1)
/ batchSize;
for (int i = 0; i < batchCount; i++)
yield return query
.SkipOnList(i * batchSize)
.Take(batchSize);
}
static IEnumerable<T> SkipOnList<T>(
IList<T> list, int count)
{
for (int i = count; i < list.Count; i++)
yield return list[i];
}
This method now does in fact run in guaranteed linear time.
Examining deferred execution
Another topic to consider is whether our method should execute immediately, or defer until we actually need the result.
The original method was entirely deferred – except for one thing: It determined the number of batches on its first call. That means that if the collection should change before we are done with our query, we might get unexpected results. We might miss elements that fall into additional batches, or we might get any number of empty batches in the end. Even worse, we might skip elements, or get them more than once, if elements are inserted or removed.
Our improved implementation does not have these problems anymore. The result will contain exactly those elements that were in the sequence when we called the method. In that way it is executed immediately.
The result of the method is deferred however. Double so in fact. Both the outer for loop, as well as the skip-take combination use yield return
to defer execution.
Implementing total deferred execution
While the previous implementation is entirely adequate, I thought it would be nice to see the method implemented in an completely deferred way.
To make sure we do not end up with more code than we need, let us start completely fresh by specifying in pseudo code what we want our method to do:
while(query is not empty)
{
return up to batchSize items from query as IEnumerable
}
That is really all we want to do. No calculations or other complications are required.
The way we can implement this is by using only a single IEnumerator<T>
, that we share between the logic that splits into batches, and the logic that enumerates each batch.
The first step is to get the enumerator, and loop over it, until there are no more elements left:
using(var enumerator = query.GetEnumerator())
{
while (enumerator.MoveNext())
{
// return batchSize items here!
}
}
Note how we use the using
statement to make sure the enumerator will be appropriately disposed in case of an exception.
We have multiple options for implementing the loop’s body. For example, we could create a List<T>
which we fill with our items, and then return and switch out with a new list, every time it reaches the maximum batch size.
This would again result in a not fully deferred solution. So instead we will write a helper method which allows us to return a deferred IEnumerable<T>
from an enumerator, with a maximum number of elements:
static IEnumerable<T> enumerateSome<T>
(this IEnumerator<T> enumerator, int count)
{
for (int i = 0; i < count; i++)
{
yield return enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
Note how we call MoveNext()
after yielding the first element of the batch. The reason for this is that we already called it once in our calling method, to make sure there are any elements to batch in the first place. Other than that, the method is rather straight forward.
All in all, we end up with the following solution:
public static IEnumerable<IEnumerable<T>> Batch<T>
(this IEnumerable<T> query, int batchSize)
{
using(var enumerator = query.GetEnumerator())
{
while (enumerator.MoveNext())
{
yield return enumerator.enumerateSome(batchSize);
}
}
}
static IEnumerable<T> enumerateSome<T>
(this IEnumerator<T> enumerator, int count)
{
for (int i = 0; i < count; i++)
{
yield return enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
This solution is completely deferred – on both levels – and has linear runtime as well.
Conclusion
Starting out with a quick implementation of a method to batch or page a sequence into equal parts, I hope this exploration of how to analyze and optimize a method like this has been interesting.
We ended up with what I consider a rather elegant – even pretty – implementation that I hope will come in handy for some.
Enjoy the pixels!
Reference: | LINQ Extensions 3 – Batch into sub-sequences from our NCG partner Paul Scharf at the GameDev<T> blog. |