Module arrow2::compute::merge_sort
source · [−]compute_merge_sort
only.Expand description
Functions to perform merge-sorts.
The goal of merge-sort is to merge two sorted arrays, [a0, a1]
, merge_sort(a0, a1)
,
so that the resulting array is sorted, i.e. the following invariant upholds:
sort(merge_sort(a0, a1)) == merge_sort(a0, a1)
for any two sorted arrays a0
and a1
.
Given that two sorted arrays are more likely to be partially sorted within each other,
and that the resulting array is built by taking elements from each array, it is
advantageous to take
slices of items, not items, from each array.
As such, this module’s main data representation is (i: usize, start: usize, len: usize)
,
which represents a slice of array i
.
In this context, merge_sort
is composed by two main operations:
- compute the array of slices
v
that construct a new sorted array froma0
anda1
. take_arrays
froma0
anda1
, creating the sorted array.
In the extreme case where the two arrays are already sorted between then (e.g. [0, 2]
, [3, 4]
),
we need two slices, v = vec![(0, 0, a0.len()), (1, 0, a1.len())]
. The higher the
inter-leave between the two arrays, the more slices will be needed, and
generally the more expensive the take
operation will be.
Merge-sort multiple arrays
The main advantage of merge-sort over sort
is that it can be parallelized.
For example, given a set of arrays [a0, a1, a2, a3]
representing the same field,
e.g. over 4 batches of arrays, they can be sorted in parallel as follows (pseudo-code):
// in parallel
let a0 = sort(a0);
let a1 = sort(a1);
let a2 = sort(a2);
let a3 = sort(a3);
// in parallel and recursively
let slices1 = merge_sort_slices(a0, a1);
let slices2 = merge_sort_slices(a2, a3);
let slices = merge_sort_slices(slices1, slices2);
let array = take_arrays(&[a0, a1, a2, a3], slices, None);
A common operation in query engines is to merge multiple fields based on the
same sorting field (e.g. merge-sort multiple batches of arrays).
To perform this, use the same idea as above, but use take_arrays
over
each independent field (which can again be parallelized):
// `slices` computed before-hand
// in parallel
let array1 = take_arrays(&[a0, a1, a2, a3], slices, None);
let array2 = take_arrays(&[b0, b1, b2, b3], slices, None);
To serialize slices, e.g. for checkpointing or transfer via Arrow’s IPC, you can store
them as 3 non-null primitive arrays (e.g. PrimitiveArray<i64>
).
Re-exports
pub use crate::compute::sort::SortOptions;
Structs
An iterator adapter that merge-sorts two iterators of MergeSlice
into a single MergeSlice
such that the resulting MergeSlice
s are ordered according to comparator
.
Functions
returns a comparison function between any two arrays of each pair of arrays, according to SortOptions
.
returns a comparison function between any two arrays of each pair of arrays, according to SortOptions
.
Implementing custom build_compare_fn
for unsupportd data types.
Combines two sorted Arrays of the same crate::datatypes::DataType
into a single sorted array.
If the arrays are not sorted (which this function does not check), the result is wrong.
Returns a vector of slices from different sorted arrays that can be used to create sorted arrays.
pairs
is an array representing multiple sorted array sets. The expected format is
Takes N arrays together through slices
under the assumption that the slices have
a total coverage of the arrays.
I.e. they are such that all elements on all arrays are picked (which is the case in sorting).
Type Definitions
A slice denoting (array_index, start, len)
representing a slice from one of N arrays.
This is used to keep track of contiguous blocks of slots.
An array of MergeSlice, [MergeSlice]
, represents inter-leaved array slices.
For example, [(0, 0, 2), (1, 0, 1), (0, 2, 3)]
represents 2 arrays (a0 and a1) arranged as follows:
[a0[0..2], a1[0..1], a0[2..3]]
This representation is useful when building arrays in memory as it allows to memcopy slices of arrays.
This is particularly useful in merge-sort because sorted arrays (passed to the merge-sort) are more likely
to have contiguous blocks of sorted elements (than by random).