Subsetting and Filtering#

In this tutorial, you’ll learn how to select subsets of an xsnowDataset along its dimensions by using standard xarray indexing.

After this tutorial, you will be able to:

  1. Select data by location, time, slope, realization, or layer

  2. Mask and filter data by variable conditions

  3. Explore the xsnow classification extension for solving domain-specific, yet generic tasks

We use a small sample SNOWPACK dataset shipped with xsnow.sample_data.

import xsnow
xs = xsnow.sample_data.snp_gridded_ds()
xs = xs.compute()
print(xs)
[i] xsnow.xsnow_io: Using lazy loading with Dask
[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...
[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user
[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)
<xsnowDataset>
  Locations:  5
  Timestamps: 416 (2024-01-16--2024-02-02)
  Profiles:   10400 total | 10400 valid | 8861 with HS>0

    employing the <xarray.Dataset> Size: 42MB
    Dimensions:                   (location: 5, time: 416, slope: 5,
                                   realization: 1, layer: 33)
    Coordinates:
      * location                  (location) <U5 100B 'VIR1A' 'VIR2A' ... 'VIR5A'
      * time                      (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ....
      * slope                     (slope) int64 40B 0 1 2 3 4
      * realization               (realization) int64 8B 0
      * layer                     (layer) int64 264B 0 1 2 3 4 5 ... 28 29 30 31 32
        altitude                  (location) float64 40B 2.372e+03 ... 2.066e+03
        latitude                  (location) float64 40B 47.15 47.44 ... 47.44 47.37
        longitude                 (location) float64 40B 11.19 11.29 11.3 11.39 11.5
        azimuth                   (slope) int64 40B 0 0 90 180 270
        inclination               (slope) int64 40B 0 38 38 38 38
        z                         (location, time, slope, realization, layer) float32 1MB ...
    Data variables: (12/91)
        ColdContentSnow           (location, time, slope, realization) float64 83kB ...
        DW                        (location, time, slope, realization) float64 83kB ...
        HN12                      (location, time, slope, realization) float64 83kB ...
        HN24                      (location, time, slope, realization) float64 83kB ...
        HN3                       (location, time, slope, realization) float64 83kB ...
        HN6                       (location, time, slope, realization) float64 83kB ...
        ...                        ...
        zS4                       (location, time, slope, realization) float64 83kB ...
        zS5                       (location, time, slope, realization) float64 83kB ...
        zSd                       (location, time, slope, realization) float64 83kB ...
        zSn                       (location, time, slope, realization) float64 83kB ...
        zSs                       (location, time, slope, realization) float64 83kB ...
        HS                        (location, time, slope, realization) float32 42kB ...
    Attributes:
        Conventions:  CF-1.8
        crs:          EPSG:4326

1. Overview of dimensions#

Before filtering, check which dimensions and coordinates the dataset contains:

print(xs.sizes)
print(xs.coords)
Frozen({'location': 5, 'time': 416, 'slope': 5, 'realization': 1, 'layer': 33})
Coordinates:
  * location     (location) <U5 100B 'VIR1A' 'VIR2A' 'VIR3A' 'VIR4A' 'VIR5A'
  * time         (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ... 2024-02-02...
  * slope        (slope) int64 40B 0 1 2 3 4
  * realization  (realization) int64 8B 0
  * layer        (layer) int64 264B 0 1 2 3 4 5 6 7 ... 25 26 27 28 29 30 31 32
    altitude     (location) float64 40B 2.372e+03 1.749e+03 ... 2.066e+03
    latitude     (location) float64 40B 47.15 47.44 47.15 47.44 47.37
    longitude    (location) float64 40B 11.19 11.29 11.3 11.39 11.5
    azimuth      (slope) int64 40B 0 0 90 180 270
    inclination  (slope) int64 40B 0 38 38 38 38
    z            (location, time, slope, realization, layer) float32 1MB 0.0 ...

Typical dimensions:

  • location – one or several sites or grid points

  • time – datetime stamps

  • slope – inclination and azimuth combinations

  • realization – simulation variant

  • layer – vertical snow layers

Coordinates are one-dimensional arrays that label the positions along each dataset dimension. They define the values you use to index or align data — for example, the actual timestamps along time, the site names and their geographical coordinates along location, or slope angles along slope. In xarray, coordinates make datasets self-describing, so selections and merges can be done by informative labels instead of hard-to-interpret integer indices.

2. Subsetting by coordinates#

xarray provides two main selectors:

  • .isel() – index-based (integer positions)

  • .sel() – label-based (coordinate values)

2.1 Selecting by index position#

# Select the last time step, third location, and first slope
subset = xs.isel(time=-1, location=2, slope=0)
print(subset['density'].isel(layer=slice(0,3)))
<xarray.DataArray 'density' (realization: 1, layer: 3)> Size: 12B
array([[246.1, 251. , 210.1]], dtype=float32)
Coordinates:
  * realization  (realization) int64 8B 0
  * layer        (layer) int64 24B 0 1 2
    altitude     float64 8B 1.86e+03
    latitude     float64 8B 47.15
    location     <U5 20B 'VIR3A'
    longitude    float64 8B 11.3
    time         datetime64[ns] 8B 2024-02-02T12:00:00
    azimuth      int64 8B 0
    inclination  int64 8B 0
    slope        int64 8B 0
    z            (realization, layer) float32 12B -8.68 -6.85 -5.33
Attributes:
    standard_name:  mass_density_of_snow
    units:          kg m-3
    long_name:      mass density of snow

Note: When you select a single element along a dimension (for example, subset.isel(realization=0)), xarray automatically squeezes that dimension out of the dataset. The dimension disappears from the list of dimensions (e.g., from (realization: 1, layer: 3) to (layer: 3)), but its coordinate (e.g., realization = 0) is still preserved in the metadata.

If you want to keep the dimension (so it still shows as length 1), wrap the index in a list:

xs.isel(realization=[0])

This can be important when chaining selections — once a dimension is dropped, you can no longer subset along it:

subset = xs.isel(slope=0)
subset.isel(slope=0)  # this will fail, because 'slope' is no longer a dimension

2.2 Selecting by coordinate values#

# Select by timestamp (label-based)
subset = xs.sel(time="2024-02-02 10:00")

Time selection is flexible—you don’t need to type the full timestamp exactly. For example, you can specify partial strings such as "2024-02-02" (to select all hours of that day) or "2024-02-02T10:00" (to select a specific hour).

You can also select using other time representations than str, such as datetime.datetime or pandas.Timestamp:

import pandas as pd
subset = xs.sel(time=pd.Timestamp("2024-02-02 10:00"))

from datetime import datetime
subset = xs.sel(time=datetime(2024, 2, 2, 10))

Or you can also select time ranges using Python-style slices:

subset = xs.sel(time=slice('2024-01-22T10:00', '2024-01-25T10:00'))

2.2 Subsetting with boolean expressions#

You can also subset a dataset using boolean expressions on coordinate values for flexible, condition-based filtering across one or more dimensions.

subset = xs.sel(slope=(xs.azimuth == 180))
subset = xs.sel(location=(xs.altitude > 2000))
subset = xs.sel(location=(xs.latitude > 47.3) & (xs.longitude < 11.4), 
                time=(xs.time > pd.Timestamp('2024-01-25T10:00')),
                slope=(xs.azimuth == 180))

3. Masking and filtering by data conditions#

Use .where() to mask or filter by values within variables. For example, retain only layers with density above 200 kg/m³:

# First create an illustrative data subset:
sub = xs.isel(location=0, time=-1, slope=0)
# Then filter based on condition:
weak = sub.where(sub['sk38'] < 0.9)
weak['sk38'].values
array([[0.17, 0.12,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]],
      dtype=float32)

Masking via .where() keeps original shape but fills non-matching elements with NaN. To drop them entirely, add the argument drop=True or use the method .dropna(). Note, however, that xarray only drops a label if it’s all-NaN across the entire dataset. See orange warning box further below for a more comprehensive discussion of this topic.

weak_drop = weak.dropna(dim='layer', how='all')
weak_drop['sk38'].values
array([[0.17, 0.12]], dtype=float32)

4. More illustrative examples#

Exercise

How many profiles with snow on the ground exist in xs?

Hide code cell source

# no where or sel needed--simply sum over the boolean array:
n_profiles_with_snow = (xs['HS'] > 0).sum()

print(n_profiles_with_snow.values)
8861

Exercise

Extract the smallest value of sk38 for each profile in xs!

Hide code cell source

min_sk38 = xs['sk38'].min(dim='layer')  # no where or sel needed--
                                        # we want it for all profiles!

Exercise

Compute the time series of the median snow height of all profiles with at least one unstable layer!

  • For simplicity, let’s define unstable layers as sk38 < 0.9

  • Use the dataset xs

Hide code cell source

unstable_profiles_mask = (xs['sk38'] < 0.9).any(dim='layer')
median_hs_unstable_timeseries = xs['HS'].where(unstable_profiles_mask)\
                                        .median(dim = ['location', 
                                                       'slope', 
                                                       'realization'], skipna=True)
print(median_hs_unstable_timeseries.sizes)
Frozen({'time': 416})

Exercise

What’s the first time, all profiles have snow on the ground?

Subset xs from that time onwards.

Hide code cell source

mask = (xs.HS > 0).all(dim=['location', 'slope', 'realization'])
first_all = xs["time"].where(mask, drop=True).min()
xs = xs.sel(time=slice(first_all, None))

Exercise

Extract the snow surface layer of each profile!

Remember that the surface index can vary over time and space—the surface “moves” to different layer positions as the snowpack evolves.

  1. Start with the dataset ts = xsnow.single_profile_timeseries()

  2. Then do it for the larger dataset xs = xsnow.sample_data.snp_gridded_ds()

Hide code cell source

ts = xsnow.single_profile_timeseries()

surface_idx = abs(ts['z']).argmin(dim='layer')  # absolute value of z -> minimum -> 0
surface_layers = ts.isel(layer=surface_idx)     # subset by layer index

print(ts.sizes)
print(surface_layers.sizes)
print(surface_layers['z'].to_series().values)
[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 1 workers...
[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1})
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Hide code cell source

import numpy as np
surface_idx = abs(xs['z']).where(xs['z'].notnull(), np.inf).argmin(dim='layer')  # NaN -> Inf
surface_layers = xs.isel(layer=surface_idx)

check = ((surface_layers['z'].notnull()) & (surface_layers['z'] != 0)).sum() == 0
print(xs.sizes)
print(surface_layers.sizes)
print(f"The remaining surface_layers dataset contains only layers of z == 0|NaN: {check.values}")
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1, 'layer': 33})
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1})
The remaining surface_layers dataset contains only layers of z == 0|NaN: True

You’ll find an explanation of why these two datasets require different approaches in the box below.

Subsetting vs masking (and why NaNs remain)

In xarray, subsetting or dropping data does not always remove every unwanted point. Because xarray datasets represent regular, gridded arrays, each coordinate value must align consistently across all variables and dimensions. Even if a specific cell is missing (NaN) at one time or location, it may still exist elsewhere in the same grid position. Therefore, xarray keeps the full structure intact and marks invalid positions as NaN instead of removing them entirely.

This means you should carefully decide whether your goal is to reduce the dataset (by dropping coordinates or dimensions) or to mask unwanted values (keeping the grid constant). Both approaches are useful:

  • Reducing the dataset is efficient when every coordinate slice of the dropped dimension is invalid.

  • Masking with NaN values preserves shape consistency, which is important for vectorized operations, merging, or time-series alignment.

In the single-profile timeseries example above, extracting the surface layer was straightforward: every profile contained valid snow layers (i.e., z values).

For the larger, multi-profile dataset, however, not every profile had snow on the ground.
Attempting to reduce the dataset to only “valid” profiles would have been impossible, because at several timestamps profiles with and without snow on the ground co-exist. If argmin() get’s all-NaN values from a snow-free profile, it cannot determine a valid layer index. To handle this, we replaced NaN values with Inf before applying argmin(). This ensures that the computed layer index points to a non-existent (and safely ignored) layer for those snow-free profiles, maintaining a consistent data shape across all locations.

5. Classification extension#

xsnow includes a classification extension that builds on simple masking to support criteria-based selection and multi-class classification.

It lets you:

  • create boolean masks from readable expressions (e.g., "density > 300 & grain_size < 0.5"),

  • attach those masks as coordinates or data variables for tidy subsetting, and

  • derive categorical classes (e.g., stability bins) from numeric variables.

See the API reference for details and examples.

5. Summary#

  • Use .isel() and .sel() for standard selection and conditional subsetting

  • Use .where() for conditional filtering and masking non-matching targets with NaN

  • Inspect xs.sizes and xs.coords to understand your selection context