GroupBy: split-apply-combine¶
xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both Dataset
and
DataArray
objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas’ implementation of
the same pipeline.
Split¶
Let’s create a simple example dataset:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
...: coords={'x': [10, 20, 30, 40],
...: 'letters': ('x', list('abba'))})
...:
In [2]: arr = ds['foo']
In [3]: ds
Out[3]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
letters (x) <U1 'a' 'b' 'b' 'a'
* x (x) int64 10 20 30 40
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
If we groupby the name of a variable or coordinate in a dataset (we can also
use a DataArray directly), we get back a GroupBy
object:
In [4]: ds.groupby('letters')
Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7f5fe8b0a550>
This object works very similarly to a pandas GroupBy object. You can view
the group indices with the groups
attribute:
In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over groups in (label, group)
pairs:
In [6]: list(ds.groupby('letters'))
Out[6]:
[('a', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
letters (x) <U1 'a' 'a'
* x (x) int64 10 40
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
('b', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
letters (x) <U1 'b' 'b'
* x (x) int64 20 30
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.
Binning¶
Sometimes you don’t want to use all the unique values to determine the groups
but instead want to “bin” the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
groupby_bins()
method.
In [7]: x_bins = [0,25,50]
In [8]: ds.groupby_bins('x', x_bins).groups
Out[8]:
{Interval(0, 25, closed='right'): [0, 1],
Interval(25, 50, closed='right'): [2, 3]}
The binning is implemented via pandas.cut, whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:
In [9]: x_bin_labels = [12.5,37.5]
In [10]: ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
Out[10]: {12.5: [0, 1], 37.5: [2, 3]}
Apply¶
To apply a function to each group, you can use the flexible
apply()
method. The resulting objects are automatically
concatenated back together along the group axis:
In [11]: def standardize(x):
....: return (x - x.mean()) / x.std()
....:
In [12]: arr.groupby('letters').apply(standardize)
Out[12]:
<xarray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.229778, 1.93741 , -0.726247],
[ 1.419796, -0.460192, -0.606579],
[-0.190642, 1.21398 , -1.376362],
[ 0.339417, -0.301806, -0.018995]])
Coordinates:
letters (x) <U1 'a' 'b' 'b' 'a'
* x (x) int64 10 20 30 40
Dimensions without coordinates: y
GroupBy objects also have a reduce()
method and
methods like mean()
as shortcuts for applying an
aggregation function:
In [13]: arr.groupby('letters').mean(dim='x')
Out[13]:
<xarray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.334998, 0.669865, 0.354236],
[ 0.674306, 0.608502, 0.229662]])
Coordinates:
* letters (letters) object 'a' 'b'
Dimensions without coordinates: y
Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:
In [14]: ds.groupby('x').std()
Out[14]:
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) <U1 'a' 'b' 'b' 'a'
Data variables:
foo (x) float64 0.3684 0.2554 0.2931 0.06957
First and last¶
There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:
In [15]: ds.groupby('letters').first()
Out[15]:
<xarray.Dataset>
Dimensions: (letters: 2, y: 3)
Coordinates:
* letters (letters) object 'a' 'b'
Dimensions without coordinates: y
Data variables:
foo (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
By default, they skip missing values (control this with skipna
).
Grouped arithmetic¶
GroupBy objects also support a limited set of binary arithmetic operations, as
a shortcut for mapping over all unique labels. Binary arithmetic is supported
for (GroupBy, Dataset)
and (GroupBy, DataArray)
pairs, as long as the
dataset or data array uses the unique grouped values as one of its index
coordinates. For example:
In [16]: alt = arr.groupby('letters').mean()
In [17]: alt
Out[17]:
<xarray.DataArray 'foo' (letters: 2)>
array([ 0.453033, 0.504157])
Coordinates:
* letters (letters) object 'a' 'b'
In [18]: ds.groupby('letters') - alt