{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Subsetting and Filtering\n", "\n", "In this tutorial, you’ll learn how to select subsets of an `xsnowDataset` along its dimensions by using standard `xarray` indexing.\n", "\n", "After this tutorial, you will be able to:\n", "\n", "1) Select data by **location**, **time**, **slope**, **realization**, or **layer**\n", "2) Mask and filter data by **variable conditions**\n", "3) Explore the `xsnow` classification extension for solving domain-specific, yet generic tasks\n", "\n", "We use a small sample SNOWPACK dataset shipped with `xsnow.sample_data`." ] }, { "cell_type": "code", "execution_count": 1, "id": "e432d592", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Using lazy loading with Dask\n", "[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...\n", "[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user\n", "[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Locations: 5\n", " Timestamps: 416 (2024-01-16--2024-02-02)\n", " Profiles: 10400 total | 10400 valid | 8861 with HS>0\n", "\n", " employing the Size: 42MB\n", " Dimensions: (location: 5, time: 416, slope: 5,\n", " realization: 1, layer: 33)\n", " Coordinates:\n", " altitude (location) float64 40B 2.372e+03 ... 2.066e+03\n", " latitude (location) float64 40B 47.15 47.44 ... 47.44 47.37\n", " * location (location) Size: 12B\n", "array([[246.1, 251. , 210.1]], dtype=float32)\n", "Coordinates:\n", " altitude float64 8B 1.86e+03\n", " latitude float64 8B 47.15\n", " location 2000))" ] }, { "cell_type": "code", "execution_count": 9, "id": "e4c5fce1", "metadata": {}, "outputs": [], "source": [ "subset = xs.sel(location=(xs.latitude > 47.3) & (xs.longitude < 11.4), \n", " time=(xs.time > pd.Timestamp('2024-01-25T10:00')),\n", " slope=(xs.azimuth == 180))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Masking and filtering by data conditions\n", "\n", "Use `.where()` to mask or filter by values within variables. For example, retain only layers with density above 200 kg/m³:" ] }, { "cell_type": "code", "execution_count": 10, "id": "9edd4553", "metadata": {}, "outputs": [], "source": [ "# First create an illustrative data subset:\n", "sub = xs.isel(location=0, time=-1, slope=0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.17, 0.12, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],\n", " dtype=float32)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Then filter based on condition:\n", "weak = sub.where(sub['sk38'] < 0.9)\n", "weak['sk38'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Masking via `.where()` **keeps original shape** but fills non-matching elements with `NaN`. To drop them entirely, add the argument `drop=True` or use the method `.dropna()`. Note, however, that xarray **only drops a label if it’s all-`NaN` across the entire dataset**. See orange warning box further below for a more comprehensive discussion of this topic.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.17, 0.12]], dtype=float32)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weak_drop = weak.dropna(dim='layer', how='all')\n", "weak_drop['sk38'].values" ] }, { "cell_type": "markdown", "id": "40a31af5", "metadata": {}, "source": [ "## 4. More illustrative examples" ] }, { "cell_type": "markdown", "id": "6468504c", "metadata": {}, "source": [ "```{admonition} Exercise\n", ":class: tip\n", "\n", "**How many profiles with snow on the ground exist in `xs`?**\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "id": "30a0623c", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8861\n" ] } ], "source": [ "# no where or sel needed--simply sum over the boolean array:\n", "n_profiles_with_snow = (xs['HS'] > 0).sum()\n", "\n", "print(n_profiles_with_snow.values)" ] }, { "cell_type": "markdown", "id": "1aeb0628", "metadata": {}, "source": [ "```{admonition} Exercise\n", ":class: tip\n", "\n", "**Extract the smallest value of sk38 for each profile in `xs`!**\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "id": "8c3a48e3", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "min_sk38 = xs['sk38'].min(dim='layer') # no where or sel needed--\n", " # we want it for all profiles!" ] }, { "cell_type": "markdown", "id": "7a43bb8b", "metadata": {}, "source": [ "```{admonition} Exercise\n", ":class: tip\n", "\n", "**Compute the time series of the median snow height of all profiles with at least one unstable layer!**\n", "\n", " * For simplicity, let's define unstable layers as sk38 < 0.9\n", " * Use the dataset `xs`\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "id": "f407aa22", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'time': 416})\n" ] } ], "source": [ "unstable_profiles_mask = (xs['sk38'] < 0.9).any(dim='layer')\n", "median_hs_unstable_timeseries = xs['HS'].where(unstable_profiles_mask)\\\n", " .median(dim = ['location', \n", " 'slope', \n", " 'realization'], skipna=True)\n", "print(median_hs_unstable_timeseries.sizes)" ] }, { "cell_type": "markdown", "id": "9edbc1d6", "metadata": {}, "source": [ "```{admonition} Exercise\n", ":class: tip\n", "\n", "**What's the first time, *all* profiles have snow on the ground?**\n", "\n", "Subset `xs` from that time onwards.\n", "```" ] }, { "cell_type": "code", "execution_count": 16, "id": "fe606eb0", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "mask = (xs.HS > 0).all(dim=['location', 'slope', 'realization'])\n", "first_all = xs[\"time\"].where(mask, drop=True).min()\n", "xs = xs.sel(time=slice(first_all, None))" ] }, { "cell_type": "markdown", "id": "4d72bcc3", "metadata": {}, "source": [ "```{admonition} Exercise\n", ":class: tip\n", "\n", "**Extract the snow surface layer of each profile!**\n", "\n", "Remember that the surface index can vary over time and space---the surface “moves” to different layer positions as the snowpack evolves.\n", "\n", " 1. Start with the dataset `ts = xsnow.single_profile_timeseries()`\n", " 2. Then do it for the larger dataset `xs = xsnow.sample_data.snp_gridded_ds()`\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "id": "aefa0b54", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 13 workers...\n", "[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n", "[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})\n", "Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1})\n", "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n" ] } ], "source": [ "ts = xsnow.single_profile_timeseries()\n", "\n", "surface_idx = abs(ts['z']).argmin(dim='layer') # absolute value of z -> minimum -> 0\n", "surface_layers = ts.isel(layer=surface_idx) # subset by layer index\n", "\n", "print(ts.sizes)\n", "print(surface_layers.sizes)\n", "print(surface_layers['z'].to_series().values)" ] }, { "cell_type": "code", "execution_count": 18, "id": "59b7c413", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1, 'layer': 33})\n", "Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1})\n", "The remaining surface_layers dataset contains only layers of z == 0|NaN: True\n" ] } ], "source": [ "import numpy as np\n", "surface_idx = abs(xs['z']).where(xs['z'].notnull(), np.inf).argmin(dim='layer') # NaN -> Inf\n", "surface_layers = xs.isel(layer=surface_idx)\n", "\n", "check = ((surface_layers['z'].notnull()) & (surface_layers['z'] != 0)).sum() == 0\n", "print(xs.sizes)\n", "print(surface_layers.sizes)\n", "print(f\"The remaining surface_layers dataset contains only layers of z == 0|NaN: {check.values}\")" ] }, { "cell_type": "markdown", "id": "1f575998", "metadata": {}, "source": [ "You'll find an explanation of why these two datasets require different approaches in the box below." ] }, { "cell_type": "markdown", "id": "2e4cbcf7", "metadata": {}, "source": [ "```{admonition} Subsetting vs masking (and why NaNs remain)\n", ":class: warning\n", "\n", "In `xarray`, subsetting or dropping data does **not always remove every unwanted point**.\n", "Because `xarray` datasets represent **regular, gridded arrays**, each coordinate value must align consistently across all variables and dimensions.\n", "Even if a specific cell is missing (`NaN`) at one time or location, it may still exist elsewhere in the same grid position.\n", "Therefore, `xarray` keeps the full structure intact and marks invalid positions as `NaN` instead of removing them entirely.\n", "\n", "This means you should carefully decide whether your goal is to **reduce** the dataset (by dropping coordinates or dimensions) or to **mask** unwanted values (keeping the grid constant).\n", "Both approaches are useful:\n", "- **Reducing** the dataset is efficient when every coordinate slice of the dropped dimension is invalid.\n", "- **Masking** with `NaN` values preserves shape consistency, which is important for vectorized operations, merging, or time-series alignment.\n", "\n", "In the single-profile timeseries example above, extracting the surface layer was straightforward: every profile contained valid snow layers (i.e., `z` values). \n", "\n", "For the larger, multi-profile dataset, however, not every profile had snow on the ground. \n", "Attempting to reduce the dataset to only “valid” profiles would have been impossible, because at several timestamps profiles with and without snow on the ground co-exist.\n", "If `argmin()` get's all-`NaN` values from a snow-free profile, it cannot determine a valid layer index.\n", "To handle this, we replaced `NaN` values with `Inf` before applying `argmin()`.\n", "This ensures that the computed layer index points to a non-existent (and safely ignored) layer for those snow-free profiles, maintaining a consistent data shape across all locations.\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Classification extension\n", "\n", "`xsnow` includes a [classification extension](../api/_generated/xsnow.extensions.classification) that builds on simple masking to support criteria-based selection and multi-class classification.\n", "\n", "It lets you:\n", "\n", " * create boolean masks from readable expressions (e.g., `\"density > 300 & grain_size < 0.5\"`), \n", " or custom classification functions.\n", " * conveniently attach those masks as coordinates or data variables for tidy subsetting,\n", " * derive categorical classes (e.g., stability bins) from numeric variables, and\n", " * group spatially based on polygons.\n", "\n", "See the API reference for details and illustrative examples. Two similar examples are shown here to highlight the functionality:\n", "\n", "#### Classify layers into stability classes\n", "The following example classifies all layers into different stability classes. For simplicity, we simply use SK38 alone. You could obviously superimpose other conditions such as *RTA > 0.8* for real applications.\n", "The example \n", "* performs a clean classification (including best practices for metadata annotations)\n", "* prints a summary table of how many layers exist in each stability class\n", "* filters for very poor layers and masks others with NaN" ] }, { "cell_type": "code", "execution_count": 19, "id": "e7ae4bb7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SK38_class\n", "0 10817\n", "1 4613\n", "2 831\n", "3 274964\n", "Name: SK38_class, dtype: int64\n" ] } ], "source": [ "import numpy as np\n", "from xsnow.extensions.classification import classify_by_bins\n", "\n", "# Prepare for classification\n", "sk38_bins = np.array([0.5, 0.95, 1.05])\n", "sk38_mapping = {\n", " 0: \"very poor (<=0.5)\",\n", " 1: \"poor (<0.95)\",\n", " 2: \"fair (<1.05)\",\n", " 3: \"rather stable (>1.05)\",\n", "}\n", "classifier_attrs = {\n", " \"mapping\": sk38_mapping,\n", " \"long_name\": \"SK38 stability class\",\n", " \"classification_function\": \"classify_by_bins\",\n", " \"bin_edges\": sk38_bins.tolist(),\n", " \"right_closed\": True,\n", "}\n", "# Perform the classification\n", "xs2 = xs.classify(\n", " classify_by_bins,\n", " output_name=\"SK38_class\",\n", " output_kind=\"coord\",\n", " input_var=\"sk38\",\n", " attrs=classifier_attrs,\n", " bins=sk38_bins,\n", " right=True,\n", ")\n", "\n", "# After classification, \n", "# get a summary table for number of layers per stability class\n", "print(\n", " xs2.SK38_class.to_series().dropna().\\\n", " groupby(xs2.SK38_class.to_series()).size()\n", ")\n", "\n", "# Or filter for those layers with very poor stability\n", "xs_sk38_very_poor = xs2.where(xs2.SK38_class == 0)" ] }, { "cell_type": "markdown", "id": "6cfe1768", "metadata": {}, "source": [ "#### Classify profiles by spatial polygons\n", "This example shows how to\n", "* subset to all profiles from a specific region/polygon\n", "* aggregate by region and/or other coordinates (existing or custom ones)" ] }, { "cell_type": "code", "execution_count": 20, "id": "058ff896", "metadata": {}, "outputs": [], "source": [ "# Create two regions and leave one location outside\n", "regions = {\n", " \"valley\":\"POLYGON ((11.15 47.12, 11.15 47.17, 11.33 47.17, 11.33 47.12, 11.15 47.12))\",\n", " \"ridge\": \"POLYGON ((11.25 47.42, 11.25 47.46, 11.43 47.46, 11.43 47.42, 11.25 47.42))\",\n", "}\n", "xs_regions = xs.classify_spatially(regions, name=\"region\", kind=\"coord\",\n", " polygons_crs=\"EPSG:4326\")\n", "\n", "# Subset to only one region\n", "xs_valley = xs_regions.where(xs_regions.region == \"valley\", drop=True)\n", "\n", "# Group by region and slope and aggregate\n", "xs_stats = xs_regions.groupby([\"region\", \"slope\"]).mean()\n", "\n", "# Group by region, slope, and elevation band\n", "# there is no band coordinate yet, so we create one with another classification:\n", "xs_regionbands = xs_regions.classify_criteria(\n", " \"altitude >= 2000\",\n", " name=\"band\",\n", " kind=\"coord\",\n", " attrs={\"mapping\": {False: \"<2000m\", True: \">=2000m\"}},\n", ")\n", "xs_regionband_average = xs_regionbands.groupby([\"region\", \"slope\", \"band\"]).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Summary\n", "\n", "- Use `.isel()` and `.sel()` for standard selection and conditional subsetting \n", "- Use `.where()` for conditional filtering and masking non-matching targets with `NaN`\n", "- Inspect `xs.sizes` and `xs.coords` to understand your selection context\n", "- Use the classification extension for solving more complex tasks conveniently\n" ] } ], "metadata": { "authors": [ { "name": "xsnow Documentation Team" } ], "kernelspec": { "display_name": "xsnow-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }