Subsetting and Filtering#
In this tutorial, you’ll learn how to select subsets of an xsnowDataset along its dimensions by using standard xarray indexing.
After this tutorial, you will be able to:
Select data by location, time, slope, realization, or layer
Mask and filter data by variable conditions
Explore the
xsnowclassification extension for solving domain-specific, yet generic tasks
We use a small sample SNOWPACK dataset shipped with xsnow.sample_data.
import xsnow
xs = xsnow.sample_data.snp_gridded_ds()
xs = xs.compute()
print(xs)
[i] xsnow.xsnow_io: Using lazy loading with Dask
[i] xsnow.xsnow_io: Analyzing file sizes for 25 files...
[i] xsnow.xsnow_io: File size range: 223.9KB - 1620.6KB
[i] xsnow.xsnow_io: Scanning 10 largest files to determine max layer count...
[i] xsnow.xsnow_io: ✅ Smart max_layers determination complete:
[i] xsnow.xsnow_io: - Checked: 25 file sizes
[i] xsnow.xsnow_io: - Scanned: 10 largest files
[i] xsnow.xsnow_io: - Max layers found: 33
[i] xsnow.xsnow_io: - Using max_layers: 36 (with 10% buffer - partial scan)
[i] xsnow.xsnow_io: - Time elapsed: 0.10 seconds
[i] xsnow.xsnow_io: Creating TRULY LAZY datasets using dask.delayed...
[i] xsnow.xsnow_io: Using max_layers dimension: 36
[i] xsnow.xsnow_io: Profile data will NOT be read until .compute() or .load() is called
[i] xsnow.xsnow_io: Building delayed loading graph for 25 files...
[i] xsnow.xsnow_io: ✅ Lazy structure created successfully!
[i] xsnow.xsnow_io: - Files: 25
[i] xsnow.xsnow_io: - Layer dimension: 36
[i] xsnow.xsnow_io: - Data loaded: NONE (truly lazy!)
[i] xsnow.xsnow_io: - Call .compute() or .load() to read files
[i] xsnow.xsnow_io: Computing delayed datasets sequentially...
[i] xsnow.xsnow_io: Created 25 datasets
<xsnowDataset>
Locations: 5
Timestamps: 416 (2024-01-16--2024-02-02)
Profiles: 10400 total | 10400 valid | 8861 with HS>0
employing the <xarray.Dataset> Size: 46MB
Dimensions: (location: 5, time: 416, slope: 5,
realization: 1, layer: 36)
Coordinates:
* location (location) <U5 100B 'VIR1A' 'VIR2A' ... 'VIR5A'
* time (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ....
* slope (slope) int64 40B 0 1 2 3 4
* realization (realization) int64 8B 0
* layer (layer) int64 288B 0 1 2 3 4 5 ... 31 32 33 34 35
altitude (location) float64 40B 2.372e+03 ... 2.066e+03
latitude (location) float64 40B 47.15 47.44 ... 47.44 47.37
longitude (location) float64 40B 11.19 11.29 11.3 11.39 11.5
azimuth (slope) int64 40B 0 0 90 180 270
inclination (slope) int64 40B 0 38 38 38 38
z (location, time, slope, realization, layer) float32 1MB ...
Data variables: (12/91)
ColdContentSnow (location, time, slope, realization) float64 83kB ...
DW (location, time, slope, realization) float64 83kB ...
HN12 (location, time, slope, realization) float64 83kB ...
HN24 (location, time, slope, realization) float64 83kB ...
HN3 (location, time, slope, realization) float64 83kB ...
HN6 (location, time, slope, realization) float64 83kB ...
... ...
zS4 (location, time, slope, realization) float64 83kB ...
zS5 (location, time, slope, realization) float64 83kB ...
zSd (location, time, slope, realization) float64 83kB ...
zSn (location, time, slope, realization) float64 83kB ...
zSs (location, time, slope, realization) float64 83kB ...
HS (location, time, slope, realization) float32 42kB ...
Attributes:
Conventions: CF-1.8
crs: EPSG:4326
1. Overview of dimensions#
Before filtering, check which dimensions and coordinates the dataset contains:
print(xs.sizes)
print(xs.coords)
Frozen({'location': 5, 'time': 416, 'slope': 5, 'realization': 1, 'layer': 36})
Coordinates:
* location (location) <U5 100B 'VIR1A' 'VIR2A' 'VIR3A' 'VIR4A' 'VIR5A'
* time (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ... 2024-02-02...
* slope (slope) int64 40B 0 1 2 3 4
* realization (realization) int64 8B 0
* layer (layer) int64 288B 0 1 2 3 4 5 6 7 ... 28 29 30 31 32 33 34 35
altitude (location) float64 40B 2.372e+03 1.749e+03 ... 2.066e+03
latitude (location) float64 40B 47.15 47.44 47.15 47.44 47.37
longitude (location) float64 40B 11.19 11.29 11.3 11.39 11.5
azimuth (slope) int64 40B 0 0 90 180 270
inclination (slope) int64 40B 0 38 38 38 38
z (location, time, slope, realization, layer) float32 1MB 0.0 ...
Typical dimensions:
location– one or several sites or grid pointstime– datetime stampsslope– inclination and azimuth combinationsrealization– simulation variantlayer– vertical snow layers
Coordinates are one-dimensional arrays that label the positions along each dataset dimension. They define the values you use to index or align data — for example, the actual timestamps along time, the site names and their geographical coordinates along location, or slope angles along slope. In xarray, coordinates make datasets self-describing, so selections and merges can be done by informative labels instead of hard-to-interpret integer indices.
2. Subsetting by coordinates#
xarray provides two main selectors:
.isel()– index-based (integer positions).sel()– label-based (coordinate values)
2.1 Selecting by index position#
# Select the last time step, third location, and first slope
subset = xs.isel(time=-1, location=2, slope=0)
print(subset['density'].isel(layer=slice(0,3)))
<xarray.DataArray 'density' (realization: 1, layer: 3)> Size: 12B
array([[246.1, 251. , 210.1]], dtype=float32)
Coordinates:
* realization (realization) int64 8B 0
* layer (layer) int64 24B 0 1 2
altitude float64 8B 1.86e+03
latitude float64 8B 47.15
location <U5 20B 'VIR3A'
longitude float64 8B 11.3
time datetime64[ns] 8B 2024-02-02T12:00:00
azimuth int64 8B 0
inclination int64 8B 0
slope int64 8B 0
z (realization, layer) float32 12B -8.68 -6.85 -5.33
Attributes:
standard_name: mass_density_of_snow
units: kg m-3
long_name: mass density of snow
Note: When you select a single element along a dimension (for example, subset.isel(realization=0)), xarray automatically squeezes that dimension out of the dataset. The dimension disappears from the list of dimensions (e.g., from (realization: 1, layer: 3) to (layer: 3)), but its coordinate (e.g., realization = 0) is still preserved in the metadata.
If you want to keep the dimension (so it still shows as length 1), wrap the index in a list:
xs.isel(realization=[0])
This can be important when chaining selections — once a dimension is dropped, you can no longer subset along it:
subset = xs.isel(slope=0)
subset.isel(slope=0) # this will fail, because 'slope' is no longer a dimension
2.2 Selecting by coordinate values#
# Select by timestamp (label-based)
subset = xs.sel(time="2024-02-02 10:00")
Time selection is flexible—you don’t need to type the full timestamp exactly. For example, you can specify partial strings such as "2024-02-02" (to select all hours of that day) or "2024-02-02T10:00" (to select a specific hour).
You can also select using other time representations than str, such as datetime.datetime or pandas.Timestamp:
import pandas as pd
subset = xs.sel(time=pd.Timestamp("2024-02-02 10:00"))
from datetime import datetime
subset = xs.sel(time=datetime(2024, 2, 2, 10))
Or you can also select time ranges using Python-style slices:
subset = xs.sel(time=slice('2024-01-22T10:00', '2024-01-25T10:00'))
2.2 Subsetting with boolean expressions#
You can also subset a dataset using boolean expressions on coordinate values for flexible, condition-based filtering across one or more dimensions.
subset = xs.sel(slope=(xs.azimuth == 180))
subset = xs.sel(location=(xs.altitude > 2000))
subset = xs.sel(location=(xs.latitude > 47.3) & (xs.longitude < 11.4),
time=(xs.time > pd.Timestamp('2024-01-25T10:00')),
slope=(xs.azimuth == 180))
3. Masking and filtering by data conditions#
Use .where() to mask or filter by values within variables. For example, retain only layers with density above 200 kg/m³:
# First create an illustrative data subset:
sub = xs.isel(location=0, time=-1, slope=0)
# Then filter based on condition:
weak = sub.where(sub['sk38'] < 0.9)
weak['sk38'].values
array([[0.17, 0.12, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan]], dtype=float32)
Masking via .where() keeps original shape but fills non-matching elements with NaN. To drop them entirely, add the argument drop=True or use the method .dropna(). Note, however, that xarray only drops a label if it’s all-NaN across the entire dataset. See orange warning box further below for a more comprehensive discussion of this topic.
weak_drop = weak.dropna(dim='layer', how='all')
weak_drop['sk38'].values
array([[0.17, 0.12]], dtype=float32)
4. More illustrative examples#
Exercise
How many profiles with snow on the ground exist in xs?
8861
Exercise
Extract the smallest value of sk38 for each profile in xs!
Exercise
Compute the time series of the median snow height of all profiles with at least one unstable layer!
For simplicity, let’s define unstable layers as sk38 < 0.9
Use the dataset
xs
Frozen({'time': 416})
Exercise
What’s the first time, all profiles have snow on the ground?
Subset xs from that time onwards.
Exercise
Extract the snow surface layer of each profile!
Remember that the surface index can vary over time and space—the surface “moves” to different layer positions as the snowpack evolves.
Start with the dataset
ts = xsnow.single_profile_timeseries()Then do it for the larger dataset
xs = xsnow.sample_data.snp_gridded_ds()
[i] xsnow.xsnow_io: Analyzing file sizes for 2 files...
[i] xsnow.xsnow_io: File size range: 4592.3KB - 4816.0KB
[i] xsnow.xsnow_io: Scanning 2 largest files to determine max layer count...
[i] xsnow.xsnow_io: ✅ Smart max_layers determination complete:
[i] xsnow.xsnow_io: - Checked: 2 file sizes
[i] xsnow.xsnow_io: - Scanned: 2 largest files
[i] xsnow.xsnow_io: - Max layers found: 59
[i] xsnow.xsnow_io: - Using max_layers: 59 (exact - scanned all files)
[i] xsnow.xsnow_io: - Time elapsed: 0.04 seconds
[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 1 workers...
[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})
Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1})
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1, 'layer': 36})
Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1})
The remaining surface_layers dataset contains only layers of z == 0|NaN: True
You’ll find an explanation of why these two datasets require different approaches in the box below.
Subsetting vs masking (and why NaNs remain)
In xarray, subsetting or dropping data does not always remove every unwanted point.
Because xarray datasets represent regular, gridded arrays, each coordinate value must align consistently across all variables and dimensions.
Even if a specific cell is missing (NaN) at one time or location, it may still exist elsewhere in the same grid position.
Therefore, xarray keeps the full structure intact and marks invalid positions as NaN instead of removing them entirely.
This means you should carefully decide whether your goal is to reduce the dataset (by dropping coordinates or dimensions) or to mask unwanted values (keeping the grid constant). Both approaches are useful:
Reducing the dataset is efficient when every coordinate slice of the dropped dimension is invalid.
Masking with
NaNvalues preserves shape consistency, which is important for vectorized operations, merging, or time-series alignment.
In the single-profile timeseries example above, extracting the surface layer was straightforward: every profile contained valid snow layers (i.e., z values).
For the larger, multi-profile dataset, however, not every profile had snow on the ground.
Attempting to reduce the dataset to only “valid” profiles would have been impossible, because at several timestamps profiles with and without snow on the ground co-exist.
If argmin() get’s all-NaN values from a snow-free profile, it cannot determine a valid layer index.
To handle this, we replaced NaN values with Inf before applying argmin().
This ensures that the computed layer index points to a non-existent (and safely ignored) layer for those snow-free profiles, maintaining a consistent data shape across all locations.
5. Classification extension#
xsnow includes a classification extension that builds on simple masking to support criteria-based selection and multi-class classification.
It lets you:
create boolean masks from readable expressions (e.g.,
"density > 300 & grain_size < 0.5"),
or custom classification functions.conveniently attach those masks as coordinates or data variables for tidy subsetting,
derive categorical classes (e.g., stability bins) from numeric variables, and
group spatially based on polygons.
See the API reference for details and illustrative examples. Two similar examples are shown here to highlight the functionality:
Classify layers into stability classes#
The following example classifies all layers into different stability classes. For simplicity, we simply use SK38 alone. You could obviously superimpose other conditions such as RTA > 0.8 for real applications. The example
performs a clean classification (including best practices for metadata annotations)
prints a summary table of how many layers exist in each stability class
filters for very poor layers and masks others with NaN
import numpy as np
from xsnow.extensions.classification import classify_by_bins
# Prepare for classification
sk38_bins = np.array([0.5, 0.95, 1.05])
sk38_mapping = {
0: "very poor (<=0.5)",
1: "poor (<0.95)",
2: "fair (<1.05)",
3: "rather stable (>1.05)",
}
classifier_attrs = {
"mapping": sk38_mapping,
"long_name": "SK38 stability class",
"classification_function": "classify_by_bins",
"bin_edges": sk38_bins.tolist(),
"right_closed": True,
}
# Perform the classification
xs2 = xs.classify(
classify_by_bins,
output_name="SK38_class",
output_kind="coord",
input_var="sk38",
attrs=classifier_attrs,
bins=sk38_bins,
right=True,
)
# After classification,
# get a summary table for number of layers per stability class
print(
xs2.SK38_class.to_series().dropna().\
groupby(xs2.SK38_class.to_series()).size()
)
# Or filter for those layers with very poor stability
xs_sk38_very_poor = xs2.where(xs2.SK38_class == 0)
SK38_class
0 10817
1 4613
2 831
3 301439
Name: SK38_class, dtype: int64
Classify profiles by spatial polygons#
This example shows how to
subset to all profiles from a specific region/polygon
aggregate by region and/or other coordinates (existing or custom ones)
# Create two regions and leave one location outside
regions = {
"valley":"POLYGON ((11.15 47.12, 11.15 47.17, 11.33 47.17, 11.33 47.12, 11.15 47.12))",
"ridge": "POLYGON ((11.25 47.42, 11.25 47.46, 11.43 47.46, 11.43 47.42, 11.25 47.42))",
}
xs_regions = xs.classify_spatially(regions, name="region", kind="coord",
polygons_crs="EPSG:4326")
# Subset to only one region
xs_valley = xs_regions.where(xs_regions.region == "valley", drop=True)
# Group by region and slope and aggregate
xs_stats = xs_regions.groupby(["region", "slope"]).mean()
# Group by region, slope, and elevation band
# there is no band coordinate yet, so we create one with another classification:
xs_regionbands = xs_regions.classify_criteria(
"altitude >= 2000",
name="band",
kind="coord",
attrs={"mapping": {False: "<2000m", True: ">=2000m"}},
)
xs_regionband_average = xs_regionbands.groupby(["region", "slope", "band"]).mean()
5. Summary#
Use
.isel()and.sel()for standard selection and conditional subsettingUse
.where()for conditional filtering and masking non-matching targets withNaNInspect
xs.sizesandxs.coordsto understand your selection contextUse the classification extension for solving more complex tasks conveniently