Module arrow2::compute::merge_sort

Available on crate feature compute_merge_sort only.

Expand description

Functions to perform merge-sorts.

The goal of merge-sort is to merge two sorted arrays, [a0, a1], merge_sort(a0, a1), so that the resulting array is sorted, i.e. the following invariant upholds: sort(merge_sort(a0, a1)) == merge_sort(a0, a1) for any two sorted arrays a0 and a1.

Given that two sorted arrays are more likely to be partially sorted within each other, and that the resulting array is built by taking elements from each array, it is advantageous to take slices of items, not items, from each array. As such, this module’s main data representation is (i: usize, start: usize, len: usize), which represents a slice of array i.

In this context, merge_sort is composed by two main operations:

compute the array of slices v that construct a new sorted array from a0 and a1.
take_arrays from a0 and a1, creating the sorted array.

In the extreme case where the two arrays are already sorted between then (e.g. [0, 2], [3, 4]), we need two slices, v = vec![(0, 0, a0.len()), (1, 0, a1.len())]. The higher the inter-leave between the two arrays, the more slices will be needed, and generally the more expensive the take operation will be.

Merge-sort multiple arrays

The main advantage of merge-sort over sort is that it can be parallelized. For example, given a set of arrays [a0, a1, a2, a3] representing the same field, e.g. over 4 batches of arrays, they can be sorted in parallel as follows (pseudo-code):

// in parallel
let a0 = sort(a0);
let a1 = sort(a1);
let a2 = sort(a2);
let a3 = sort(a3);

// in parallel and recursively
let slices1 = merge_sort_slices(a0, a1);
let slices2 = merge_sort_slices(a2, a3);
let slices = merge_sort_slices(slices1, slices2);

let array = take_arrays(&[a0, a1, a2, a3], slices, None);

A common operation in query engines is to merge multiple fields based on the same sorting field (e.g. merge-sort multiple batches of arrays). To perform this, use the same idea as above, but use take_arrays over each independent field (which can again be parallelized):

// `slices` computed before-hand
// in parallel
let array1 = take_arrays(&[a0, a1, a2, a3], slices, None);
let array2 = take_arrays(&[b0, b1, b2, b3], slices, None);

To serialize slices, e.g. for checkpointing or transfer via Arrow’s IPC, you can store them as 3 non-null primitive arrays (e.g. PrimitiveArray<i64>).

Re-exports

pub use crate::compute::sort::SortOptions;

Structs

MergeSortSlices

An iterator adapter that merge-sorts two iterators of MergeSlice into a single MergeSlice such that the resulting MergeSlices are ordered according to comparator.

Functions

build_comparator

returns a comparison function between any two arrays of each pair of arrays, according to SortOptions.

build_comparator_impl

returns a comparison function between any two arrays of each pair of arrays, according to SortOptions. Implementing custom build_compare_fn for unsupportd data types.

merge_sort

Combines two sorted Arrays of the same crate::datatypes::DataType into a single sorted array. If the arrays are not sorted (which this function does not check), the result is wrong.

merge_sort_slices

Given two iterators of slices representing two sets of sorted Arrays, and a comparator bound to those Arrays, returns a new iterator of slices denoting how to take slices from each of the arrays such that the resulting array is sorted according to comparator

slices

Returns a vector of slices from different sorted arrays that can be used to create sorted arrays. pairs is an array representing multiple sorted array sets. The expected format is

take_arrays

Takes N arrays together through slices under the assumption that the slices have a total coverage of the arrays. I.e. they are such that all elements on all arrays are picked (which is the case in sorting).

Type Definitions

MergeSlice

A slice denoting (array_index, start, len) representing a slice from one of N arrays. This is used to keep track of contiguous blocks of slots. An array of MergeSlice, [MergeSlice], represents inter-leaved array slices. For example, [(0, 0, 2), (1, 0, 1), (0, 2, 3)] represents 2 arrays (a0 and a1) arranged as follows: [a0[0..2], a1[0..1], a0[2..3]] This representation is useful when building arrays in memory as it allows to memcopy slices of arrays. This is particularly useful in merge-sort because sorted arrays (passed to the merge-sort) are more likely to have contiguous blocks of sorted elements (than by random).