Subsetting and Filtering#
In this tutorial, you’ll learn how to select subsets of an xsnowDataset along its dimensions by using standard xarray indexing.
After this tutorial, you will be able to:
Select data by location, time, slope, realization, or layer
Mask and filter data by variable conditions
Explore the
xsnowclassification extension for solving domain-specific, yet generic tasks
We use a small sample SNOWPACK dataset shipped with xsnow.sample_data.
import xsnow
xs = xsnow.sample_data.snp_gridded_ds()
xs = xs.compute()
print(xs)
[i] xsnow.xsnow_io: Using lazy loading with Dask
[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...
[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user
[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)
<xsnowDataset>
Locations: 5
Timestamps: 416 (2024-01-16--2024-02-02)
Profiles: 10400 total | 10400 valid | 8861 with HS>0
employing the <xarray.Dataset> Size: 42MB
Dimensions: (location: 5, time: 416, slope: 5,
realization: 1, layer: 33)
Coordinates:
* location (location) <U5 100B 'VIR1A' 'VIR2A' ... 'VIR5A'
* time (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ....
* slope (slope) int64 40B 0 1 2 3 4
* realization (realization) int64 8B 0
* layer (layer) int64 264B 0 1 2 3 4 5 ... 28 29 30 31 32
altitude (location) float64 40B 2.372e+03 ... 2.066e+03
latitude (location) float64 40B 47.15 47.44 ... 47.44 47.37
longitude (location) float64 40B 11.19 11.29 11.3 11.39 11.5
azimuth (slope) int64 40B 0 0 90 180 270
inclination (slope) int64 40B 0 38 38 38 38
z (location, time, slope, realization, layer) float32 1MB ...
Data variables: (12/91)
ColdContentSnow (location, time, slope, realization) float64 83kB ...
DW (location, time, slope, realization) float64 83kB ...
HN12 (location, time, slope, realization) float64 83kB ...
HN24 (location, time, slope, realization) float64 83kB ...
HN3 (location, time, slope, realization) float64 83kB ...
HN6 (location, time, slope, realization) float64 83kB ...
... ...
zS4 (location, time, slope, realization) float64 83kB ...
zS5 (location, time, slope, realization) float64 83kB ...
zSd (location, time, slope, realization) float64 83kB ...
zSn (location, time, slope, realization) float64 83kB ...
zSs (location, time, slope, realization) float64 83kB ...
HS (location, time, slope, realization) float32 42kB ...
Attributes:
Conventions: CF-1.8
crs: EPSG:4326
1. Overview of dimensions#
Before filtering, check which dimensions and coordinates the dataset contains:
print(xs.sizes)
print(xs.coords)
Frozen({'location': 5, 'time': 416, 'slope': 5, 'realization': 1, 'layer': 33})
Coordinates:
* location (location) <U5 100B 'VIR1A' 'VIR2A' 'VIR3A' 'VIR4A' 'VIR5A'
* time (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ... 2024-02-02...
* slope (slope) int64 40B 0 1 2 3 4
* realization (realization) int64 8B 0
* layer (layer) int64 264B 0 1 2 3 4 5 6 7 ... 25 26 27 28 29 30 31 32
altitude (location) float64 40B 2.372e+03 1.749e+03 ... 2.066e+03
latitude (location) float64 40B 47.15 47.44 47.15 47.44 47.37
longitude (location) float64 40B 11.19 11.29 11.3 11.39 11.5
azimuth (slope) int64 40B 0 0 90 180 270
inclination (slope) int64 40B 0 38 38 38 38
z (location, time, slope, realization, layer) float32 1MB 0.0 ...
Typical dimensions:
location– one or several sites or grid pointstime– datetime stampsslope– inclination and azimuth combinationsrealization– simulation variantlayer– vertical snow layers
Coordinates are one-dimensional arrays that label the positions along each dataset dimension. They define the values you use to index or align data — for example, the actual timestamps along time, the site names and their geographical coordinates along location, or slope angles along slope. In xarray, coordinates make datasets self-describing, so selections and merges can be done by informative labels instead of hard-to-interpret integer indices.
2. Subsetting by coordinates#
xarray provides two main selectors:
.isel()– index-based (integer positions).sel()– label-based (coordinate values)
2.1 Selecting by index position#
# Select the last time step, third location, and first slope
subset = xs.isel(time=-1, location=2, slope=0)
print(subset['density'].isel(layer=slice(0,3)))
<xarray.DataArray 'density' (realization: 1, layer: 3)> Size: 12B
array([[246.1, 251. , 210.1]], dtype=float32)
Coordinates:
* realization (realization) int64 8B 0
* layer (layer) int64 24B 0 1 2
altitude float64 8B 1.86e+03
latitude float64 8B 47.15
location <U5 20B 'VIR3A'
longitude float64 8B 11.3
time datetime64[ns] 8B 2024-02-02T12:00:00
azimuth int64 8B 0
inclination int64 8B 0
slope int64 8B 0
z (realization, layer) float32 12B -8.68 -6.85 -5.33
Attributes:
standard_name: mass_density_of_snow
units: kg m-3
long_name: mass density of snow
Note: When you select a single element along a dimension (for example, subset.isel(realization=0)), xarray automatically squeezes that dimension out of the dataset. The dimension disappears from the list of dimensions (e.g., from (realization: 1, layer: 3) to (layer: 3)), but its coordinate (e.g., realization = 0) is still preserved in the metadata.
If you want to keep the dimension (so it still shows as length 1), wrap the index in a list:
xs.isel(realization=[0])
This can be important when chaining selections — once a dimension is dropped, you can no longer subset along it:
subset = xs.isel(slope=0)
subset.isel(slope=0) # this will fail, because 'slope' is no longer a dimension
2.2 Selecting by coordinate values#
# Select by timestamp (label-based)
subset = xs.sel(time="2024-02-02 10:00")
Time selection is flexible—you don’t need to type the full timestamp exactly. For example, you can specify partial strings such as "2024-02-02" (to select all hours of that day) or "2024-02-02T10:00" (to select a specific hour).
You can also select using other time representations than str, such as datetime.datetime or pandas.Timestamp:
import pandas as pd
subset = xs.sel(time=pd.Timestamp("2024-02-02 10:00"))
from datetime import datetime
subset = xs.sel(time=datetime(2024, 2, 2, 10))
Or you can also select time ranges using Python-style slices:
subset = xs.sel(time=slice('2024-01-22T10:00', '2024-01-25T10:00'))
2.2 Subsetting with boolean expressions#
You can also subset a dataset using boolean expressions on coordinate values for flexible, condition-based filtering across one or more dimensions.
subset = xs.sel(slope=(xs.azimuth == 180))
subset = xs.sel(location=(xs.altitude > 2000))
subset = xs.sel(location=(xs.latitude > 47.3) & (xs.longitude < 11.4),
time=(xs.time > pd.Timestamp('2024-01-25T10:00')),
slope=(xs.azimuth == 180))
3. Masking and filtering by data conditions#
Use .where() to mask or filter by values within variables. For example, retain only layers with density above 200 kg/m³:
# First create an illustrative data subset:
sub = xs.isel(location=0, time=-1, slope=0)
# Then filter based on condition:
weak = sub.where(sub['sk38'] < 0.9)
weak['sk38'].values
array([[0.17, 0.12, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
dtype=float32)
Masking via .where() keeps original shape but fills non-matching elements with NaN. To drop them entirely, add the argument drop=True or use the method .dropna(). Note, however, that xarray only drops a label if it’s all-NaN across the entire dataset. See orange warning box further below for a more comprehensive discussion of this topic.
weak_drop = weak.dropna(dim='layer', how='all')
weak_drop['sk38'].values
array([[0.17, 0.12]], dtype=float32)
4. More illustrative examples#
Exercise
How many profiles with snow on the ground exist in xs?
8861
Exercise
Extract the smallest value of sk38 for each profile in xs!
Exercise
Compute the time series of the median snow height of all profiles with at least one unstable layer!
For simplicity, let’s define unstable layers as sk38 < 0.9
Use the dataset
xs
Frozen({'time': 416})
Exercise
What’s the first time, all profiles have snow on the ground?
Subset xs from that time onwards.
Exercise
Extract the snow surface layer of each profile!
Remember that the surface index can vary over time and space—the surface “moves” to different layer positions as the snowpack evolves.
Start with the dataset
ts = xsnow.single_profile_timeseries()Then do it for the larger dataset
xs = xsnow.sample_data.snp_gridded_ds()
[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 1 workers...
[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1})
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1, 'layer': 33})
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1})
The remaining surface_layers dataset contains only layers of z == 0|NaN: True
You’ll find an explanation of why these two datasets require different approaches in the box below.
Subsetting vs masking (and why NaNs remain)
In xarray, subsetting or dropping data does not always remove every unwanted point.
Because xarray datasets represent regular, gridded arrays, each coordinate value must align consistently across all variables and dimensions.
Even if a specific cell is missing (NaN) at one time or location, it may still exist elsewhere in the same grid position.
Therefore, xarray keeps the full structure intact and marks invalid positions as NaN instead of removing them entirely.
This means you should carefully decide whether your goal is to reduce the dataset (by dropping coordinates or dimensions) or to mask unwanted values (keeping the grid constant). Both approaches are useful:
Reducing the dataset is efficient when every coordinate slice of the dropped dimension is invalid.
Masking with
NaNvalues preserves shape consistency, which is important for vectorized operations, merging, or time-series alignment.
In the single-profile timeseries example above, extracting the surface layer was straightforward: every profile contained valid snow layers (i.e., z values).
For the larger, multi-profile dataset, however, not every profile had snow on the ground.
Attempting to reduce the dataset to only “valid” profiles would have been impossible, because at several timestamps profiles with and without snow on the ground co-exist.
If argmin() get’s all-NaN values from a snow-free profile, it cannot determine a valid layer index.
To handle this, we replaced NaN values with Inf before applying argmin().
This ensures that the computed layer index points to a non-existent (and safely ignored) layer for those snow-free profiles, maintaining a consistent data shape across all locations.
5. Classification extension#
xsnow includes a classification extension that builds on simple masking to support criteria-based selection and multi-class classification.
It lets you:
create boolean masks from readable expressions (e.g.,
"density > 300 & grain_size < 0.5"),attach those masks as coordinates or data variables for tidy subsetting, and
derive categorical classes (e.g., stability bins) from numeric variables.
See the API reference for details and examples.
5. Summary#
Use
.isel()and.sel()for standard selection and conditional subsettingUse
.where()for conditional filtering and masking non-matching targets withNaNInspect
xs.sizesandxs.coordsto understand your selection context