High-level API
Arrow core trait the Array
, which you can think of as representing Arc<Vec<Option<T>>
with associated metadata (see metadata)).
Contrarily to Arc<Vec<Option<T>>
, arrays in this crate are represented in such a way
that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI).
Probably the simplest Array
in this crate is the PrimitiveArray<T>
. It can be
constructed from a slice of option values,
from a slice of values,
or from an iterator
A PrimitiveArray
(and every Array
implemented in this crate) has 3 components:
- A physical type (e.g.
) - A logical type (e.g.
) - Data
The main differences from a Arc<Vec<Option<T>>>
- Its data is laid out in memory as a
and anOption<Bitmap>
(see [../low_level.md]) - It has an associated logical type (
The first allows interoperability with Arrow's ecosystem and efficient SIMD operations (we will re-visit this below); the second is that it gives semantic meaning to the array. In the example
and dates
have the same in-memory representation but different logic
representations (e.g. dates are usually printed to users as "yyyy-mm-dd").
All physical types (e.g. i32
) have a "natural" logical DataType
(e.g. DataType::Int32
which is assigned when allocating arrays from iterators, slices, etc.
they can be cheaply (O(1)
) converted to via .to(DataType)
The following arrays are supported:
(just holds nulls)BooleanArray
(for ints, floats)Utf8Array<i32>
(for strings)BinaryArray<i32>
(for opaque binaries)FixedSizeBinaryArray
, but fixed size)ListArray<i32>
(array of arrays)FixedSizeListArray
(array of arrays of a fixed size)StructArray
(multiple named arrays where each row has one element from each array)UnionArray
(every row has a different logical type)DictionaryArray<K>
(nested array with encoded values)
Array as a trait object
is object safe, and all implementations of Array
and can be casted
to &dyn Array
, which enables dynamic casting and run-time nesting.
Downcast and as_any
Given a trait object array: &dyn Array
, we know its physical type via
PhysicalType: array.data_type().to_physical_type()
, which we use to downcast the array
to its concrete physical type:
There is a one to one relationship between each variant of PhysicalType
(an enum) and
an each implementation of Array
(a struct):
PhysicalType | Array |
Primitive(_) | PrimitiveArray<_> |
Binary | BinaryArray<i32> |
LargeBinary | BinaryArray<i64> |
Utf8 | Utf8Array<i32> |
LargeUtf8 | Utf8Array<i64> |
List | ListArray<i32> |
LargeList | ListArray<i64> |
FixedSizeBinary | FixedSizeBinaryArray |
FixedSizeList | FixedSizeListArray |
Struct | StructArray |
Union | UnionArray |
Map | MapArray |
Dictionary(_) | DictionaryArray<_> |
where _
represents each of the variants (e.g. PrimitiveType::Int32 <-> i32
In this context, a common idiom in using Array
as a trait object is as follows:
From Iterator
In the examples above, we've introduced how to create an array from an iterator. These APIs are available for all Arrays, and they are suitable to efficiently create them. In this section we will go a bit more in detail about these operations, and how to make them even more efficient.
This crate's APIs are generally split into two patterns: whether an operation leverages contiguous memory regions or whether it does not.
What this means is that certain operations can be performed irrespectively of whether a value
is "null" or not (e.g. PrimitiveArray<i32> + i32
can be applied to all values
via SIMD and only copy the validity bitmap independently).
When an operation benefits from such arrangement, it is advantageous to use Vec
and Into<Buffer>
If not, then use the MutableArray
API, such as
, MutableUtf8Array<O>
or MutableListArray
We have seen examples where the latter API was used. In the last example of this page you will be introduced to an example of using the former for SIMD.
Into Iterator
We've already seen how to create an array from an iterator. Most arrays also implement
Like FromIterator
, this crate contains two sets of APIs to iterate over data. Given
an array array: &PrimitiveArray<T>
, the following applies:
- If you need to iterate over
, usearray.iter()
- If you can operate over the values and validity independently,
array.values() -> &Buffer<T>
andarray.validity() -> Option<&Bitmap>
Note that case 1 is useful when e.g. you want to perform an operation that depends on both validity and values, while the latter is suitable for SIMD and copies, as they return contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs.
This idea holds more generally in this crate's arrays: values()
returns something that has
a contiguous in-memory representation, while iter()
returns items taking validity into account.
To get an iterator over contiguous values, use array.values().iter()
There is one last API that is worth mentioning, and that is Bitmap::chunks
. When performing
bitwise operations, it is often more performant to operate on chunks of bits
instead of single bits. chunks
offers a TrustedLen
of u64
with the bits
- an extra
remainder. We expose two functions,unary(Bitmap, Fn) -> Bitmap
andbinary(Bitmap, Bitmap, Fn) -> Bitmap
that use this API to efficiently perform bitmap operations.
Vectorized operations
One of the main advantages of the arrow format and its memory layout is that
it often enables SIMD. For example, an unary operation op
on a PrimitiveArray
likely emits SIMD instructions on the following code:
Some notes:
We used
, as described above: this operation leverages a contiguous memory region. -
We leveraged normal rust iterators for the operation.
We used
on the array's values irrespectively of their validity, and cloned its validity. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then usingPrimitiveArray::<O>::from_trusted_len_iter
is likely faster.
Clone on write semantics
We support the mutation of arrays in-place via clone-on-write semantics.
Essentially, all data is under an Arc
, but it can be taken via Arc::get_mut
and operated in place.
Below is a complete example of how to operate on a Box<dyn Array>
extra allocations.
// This example demos how to operate on arrays in-place.
use arrow2::array::{Array, PrimitiveArray};
use arrow2::compute::arity_assign;
fn main() {
// say we have have received an `Array`
let mut array: Box<dyn Array> = PrimitiveArray::from_vec(vec![1i32, 2]).boxed();
// we can apply a transformation to its values without allocating a new array as follows:
// 1. downcast it to the correct type (known via `array.data_type().to_physical_type()`)
let array = array
// 2. call `unary` with the function to apply to each value
arity_assign::unary(array, |x| x * 10);
// confirm that it gives the right result :)
assert_eq!(array, &PrimitiveArray::from_vec(vec![10i32, 20]));
// alternatively, you can use `get_mut_values`. Unwrap works because it is single owned
let values = array.get_mut_values().unwrap();
values[0] = 0;
assert_eq!(array, &PrimitiveArray::from_vec(vec![0, 20]));
// you can also modify the validity:
array.set_validity(Some([true, false].into()));
array.apply_validity(|bitmap| {
let mut mut_bitmap = bitmap.into_mut().right().unwrap();
mut_bitmap.set(1, true);
assert_eq!(array, &PrimitiveArray::from_vec(vec![0, 20]));