{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Subsetting and Filtering\n",
    "\n",
    "In this tutorial, you’ll learn how to select subsets of an `xsnowDataset` along its dimensions by using standard `xarray` indexing.\n",
    "\n",
    "After this tutorial, you will be able to:\n",
    "\n",
    "1) Select data by **location**, **time**, **slope**, **realization**, or **layer**\n",
    "2) Mask and filter data by **variable conditions**\n",
    "3) Explore the `xsnow` classification extension for solving domain-specific, yet generic tasks\n",
    "\n",
    "We use a small sample SNOWPACK dataset shipped with `xsnow.sample_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e432d592",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[i] xsnow.xsnow_io: Using lazy loading with Dask\n",
      "[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...\n",
      "[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user\n",
      "[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<xsnowDataset>\n",
      "  Locations:  5\n",
      "  Timestamps: 416 (2024-01-16--2024-02-02)\n",
      "  Profiles:   10400 total | 10400 valid | 8861 with HS>0\n",
      "\n",
      "    employing the <xarray.Dataset> Size: 42MB\n",
      "    Dimensions:                   (location: 5, time: 416, slope: 5,\n",
      "                                   realization: 1, layer: 33)\n",
      "    Coordinates:\n",
      "        altitude                  (location) float64 40B 2.372e+03 ... 2.066e+03\n",
      "        latitude                  (location) float64 40B 47.15 47.44 ... 47.44 47.37\n",
      "      * location                  (location) <U5 100B 'VIR1A' 'VIR2A' ... 'VIR5A'\n",
      "        longitude                 (location) float64 40B 11.19 11.29 11.3 11.39 11.5\n",
      "      * time                      (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ....\n",
      "        azimuth                   (slope) int64 40B 0 0 90 180 270\n",
      "        inclination               (slope) int64 40B 0 38 38 38 38\n",
      "      * slope                     (slope) int64 40B 0 1 2 3 4\n",
      "      * realization               (realization) int64 8B 0\n",
      "      * layer                     (layer) int64 264B 0 1 2 3 4 5 ... 28 29 30 31 32\n",
      "        z                         (location, time, slope, realization, layer) float32 1MB ...\n",
      "    Data variables: (12/91)\n",
      "        ColdContentSnow           (location, time, slope, realization) float64 83kB ...\n",
      "        DW                        (location, time, slope, realization) float64 83kB ...\n",
      "        HN12                      (location, time, slope, realization) float64 83kB ...\n",
      "        HN24                      (location, time, slope, realization) float64 83kB ...\n",
      "        HN3                       (location, time, slope, realization) float64 83kB ...\n",
      "        HN6                       (location, time, slope, realization) float64 83kB ...\n",
      "        ...                        ...\n",
      "        zS4                       (location, time, slope, realization) float64 83kB ...\n",
      "        zS5                       (location, time, slope, realization) float64 83kB ...\n",
      "        zSd                       (location, time, slope, realization) float64 83kB ...\n",
      "        zSn                       (location, time, slope, realization) float64 83kB ...\n",
      "        zSs                       (location, time, slope, realization) float64 83kB ...\n",
      "        HS                        (location, time, slope, realization) float32 42kB ...\n",
      "    Attributes:\n",
      "        Conventions:  CF-1.8\n",
      "        crs:          EPSG:4326\n"
     ]
    }
   ],
   "source": [
    "import xsnow\n",
    "xs = xsnow.sample_data.snp_gridded_ds()\n",
    "xs = xs.compute()\n",
    "print(xs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "139ca97e",
   "metadata": {},
   "source": [
    "## 1. Overview of dimensions\n",
    "Before filtering, check which dimensions and coordinates the dataset contains:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "93eca078",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen({'location': 5, 'time': 416, 'slope': 5, 'realization': 1, 'layer': 33})\n",
      "Coordinates:\n",
      "    altitude     (location) float64 40B 2.372e+03 1.749e+03 ... 2.066e+03\n",
      "    latitude     (location) float64 40B 47.15 47.44 47.15 47.44 47.37\n",
      "  * location     (location) <U5 100B 'VIR1A' 'VIR2A' 'VIR3A' 'VIR4A' 'VIR5A'\n",
      "    longitude    (location) float64 40B 11.19 11.29 11.3 11.39 11.5\n",
      "  * time         (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ... 2024-02-02...\n",
      "    azimuth      (slope) int64 40B 0 0 90 180 270\n",
      "    inclination  (slope) int64 40B 0 38 38 38 38\n",
      "  * slope        (slope) int64 40B 0 1 2 3 4\n",
      "  * realization  (realization) int64 8B 0\n",
      "  * layer        (layer) int64 264B 0 1 2 3 4 5 6 7 ... 25 26 27 28 29 30 31 32\n",
      "    z            (location, time, slope, realization, layer) float32 1MB 0.0 ...\n"
     ]
    }
   ],
   "source": [
    "print(xs.sizes)\n",
    "print(xs.coords)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "299da891",
   "metadata": {},
   "source": [
    "Typical **dimensions**:\n",
    "\n",
    "- `location` – one or several sites or grid points\n",
    "- `time` – datetime stamps\n",
    "- `slope` – inclination and azimuth combinations\n",
    "- `realization` – simulation variant\n",
    "- `layer` – vertical snow layers\n",
    "\n",
    "**Coordinates** are one-dimensional arrays that label the positions along each dataset dimension. They define the values you use to index or align data — for example, the actual timestamps along time, the site names and their geographical coordinates along location, or slope angles along slope. In xarray, coordinates make datasets self-describing, so selections and merges can be done by informative labels instead of hard-to-interpret integer indices."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42f9e620",
   "metadata": {},
   "source": [
    "## 2. Subsetting by coordinates\n",
    "\n",
    "`xarray` provides two main selectors:\n",
    "\n",
    "- `.isel()` – index-based (integer positions)\n",
    "- `.sel()` – label-based (coordinate values)\n",
    "\n",
    "### 2.1 Selecting by index position"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c8071f78",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<xarray.DataArray 'density' (realization: 1, layer: 3)> Size: 12B\n",
      "array([[246.1, 251. , 210.1]], dtype=float32)\n",
      "Coordinates:\n",
      "    altitude     float64 8B 1.86e+03\n",
      "    latitude     float64 8B 47.15\n",
      "    location     <U5 20B 'VIR3A'\n",
      "    longitude    float64 8B 11.3\n",
      "    time         datetime64[ns] 8B 2024-02-02T12:00:00\n",
      "    azimuth      int64 8B 0\n",
      "    inclination  int64 8B 0\n",
      "    slope        int64 8B 0\n",
      "  * realization  (realization) int64 8B 0\n",
      "  * layer        (layer) int64 24B 0 1 2\n",
      "    z            (realization, layer) float32 12B -8.68 -6.85 -5.33\n",
      "Attributes:\n",
      "    standard_name:  mass_density_of_snow\n",
      "    units:          kg m-3\n",
      "    long_name:      mass density of snow\n"
     ]
    }
   ],
   "source": [
    "# Select the last time step, third location, and first slope\n",
    "subset = xs.isel(time=-1, location=2, slope=0)\n",
    "print(subset['density'].isel(layer=slice(0,3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db35bc8d",
   "metadata": {},
   "source": [
    "Note: When you select a single element along a dimension (for example, `subset.isel(realization=0))`, xarray automatically *squeezes* that dimension out of the dataset. The dimension disappears from the list of dimensions (e.g., from `(realization: 1, layer: 3)` to `(layer: 3)`), but its coordinate (e.g., `realization = 0`) is still preserved in the metadata.\n",
    "\n",
    "If you want to keep the dimension (so it still shows as length 1), wrap the index in a list:\n",
    "```\n",
    "xs.isel(realization=[0])\n",
    "```\n",
    "This can be important when chaining selections — once a dimension is dropped, you can no longer subset along it:\n",
    "```\n",
    "subset = xs.isel(slope=0)\n",
    "subset.isel(slope=0)  # this will fail, because 'slope' is no longer a dimension\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01219baa",
   "metadata": {},
   "source": [
    "### 2.2 Selecting by coordinate values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "37e41ba9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select by timestamp (label-based)\n",
    "subset = xs.sel(time=\"2024-02-02 10:00\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfb144a9",
   "metadata": {},
   "source": [
    "Time selection is flexible---you don’t need to type the full timestamp exactly. For example, you can specify partial strings such as `\"2024-02-02\"` (to select all hours of that day) or `\"2024-02-02T10:00\"` (to select a specific hour). \n",
    "\n",
    "You can also select using other time representations than `str`, such as `datetime.datetime` or `pandas.Timestamp`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5f50eadd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "subset = xs.sel(time=pd.Timestamp(\"2024-02-02 10:00\"))\n",
    "\n",
    "from datetime import datetime\n",
    "subset = xs.sel(time=datetime(2024, 2, 2, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e43365",
   "metadata": {},
   "source": [
    "Or you can also select time ranges using Python-style slices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fb6875de",
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = xs.sel(time=slice('2024-01-22T10:00', '2024-01-25T10:00'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Subsetting with boolean expressions\n",
    "You can also subset a dataset using boolean expressions on coordinate values for flexible, condition-based filtering across one or more dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = xs.sel(slope=(xs.azimuth == 180))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6fd9c5e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = xs.sel(location=(xs.altitude > 2000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e4c5fce1",
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = xs.sel(location=(xs.latitude > 47.3) & (xs.longitude < 11.4), \n",
    "                time=(xs.time > pd.Timestamp('2024-01-25T10:00')),\n",
    "                slope=(xs.azimuth == 180))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Masking and filtering by data conditions\n",
    "\n",
    "Use `.where()` to mask or filter by values within variables. For example, retain only layers with density above 200 kg/m³:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9edd4553",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First create an illustrative data subset:\n",
    "sub = xs.isel(location=0, time=-1, slope=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.17, 0.12,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,\n",
       "         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,\n",
       "         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Then filter based on condition:\n",
    "weak = sub.where(sub['sk38'] < 0.9)\n",
    "weak['sk38'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Masking via `.where()` **keeps original shape** but fills non-matching elements with `NaN`. To drop them entirely, add the argument `drop=True` or use the method `.dropna()`. Note, however, that xarray **only drops a label if it’s all-`NaN` across the entire dataset**. See orange warning box further below for a more comprehensive discussion of this topic.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.17, 0.12]], dtype=float32)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weak_drop = weak.dropna(dim='layer', how='all')\n",
    "weak_drop['sk38'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40a31af5",
   "metadata": {},
   "source": [
    "## 4. More illustrative examples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6468504c",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    ":class: tip\n",
    "\n",
    "**How many profiles with snow on the ground exist in `xs`?**\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "30a0623c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8861\n"
     ]
    }
   ],
   "source": [
    "# no where or sel needed--simply sum over the boolean array:\n",
    "n_profiles_with_snow = (xs['HS'] > 0).sum()\n",
    "\n",
    "print(n_profiles_with_snow.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1aeb0628",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    ":class: tip\n",
    "\n",
    "**Extract the smallest value of sk38 for each profile in `xs`!**\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8c3a48e3",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "min_sk38 = xs['sk38'].min(dim='layer')  # no where or sel needed--\n",
    "                                        # we want it for all profiles!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a43bb8b",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    ":class: tip\n",
    "\n",
    "**Compute the time series of the median snow height of all profiles with at least one unstable layer!**\n",
    "\n",
    " * For simplicity, let's define unstable layers as sk38 < 0.9\n",
    " * Use the dataset `xs`\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f407aa22",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen({'time': 416})\n"
     ]
    }
   ],
   "source": [
    "unstable_profiles_mask = (xs['sk38'] < 0.9).any(dim='layer')\n",
    "median_hs_unstable_timeseries = xs['HS'].where(unstable_profiles_mask)\\\n",
    "                                        .median(dim = ['location', \n",
    "                                                       'slope', \n",
    "                                                       'realization'], skipna=True)\n",
    "print(median_hs_unstable_timeseries.sizes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9edbc1d6",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    ":class: tip\n",
    "\n",
    "**What's the first time, *all* profiles have snow on the ground?**\n",
    "\n",
    "Subset `xs` from that time onwards.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "fe606eb0",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "mask = (xs.HS > 0).all(dim=['location', 'slope', 'realization'])\n",
    "first_all = xs[\"time\"].where(mask, drop=True).min()\n",
    "xs = xs.sel(time=slice(first_all, None))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d72bcc3",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    ":class: tip\n",
    "\n",
    "**Extract the snow surface layer of each profile!**\n",
    "\n",
    "Remember that the surface index can vary over time and space---the surface “moves” to different layer positions as the snowpack evolves.\n",
    "\n",
    " 1. Start with the dataset `ts = xsnow.single_profile_timeseries()`\n",
    " 2. Then do it for the larger dataset `xs = xsnow.sample_data.snp_gridded_ds()`\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "aefa0b54",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 13 workers...\n",
      "[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n",
      "[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})\n",
      "Frozen({'location': 1, 'time': 381, 'slope': 1, 'realization': 1})\n",
      "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n"
     ]
    }
   ],
   "source": [
    "ts = xsnow.single_profile_timeseries()\n",
    "\n",
    "surface_idx = abs(ts['z']).argmin(dim='layer')  # absolute value of z -> minimum -> 0\n",
    "surface_layers = ts.isel(layer=surface_idx)     # subset by layer index\n",
    "\n",
    "print(ts.sizes)\n",
    "print(surface_layers.sizes)\n",
    "print(surface_layers['z'].to_series().values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "59b7c413",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1, 'layer': 33})\n",
      "Frozen({'location': 5, 'time': 353, 'slope': 5, 'realization': 1})\n",
      "The remaining surface_layers dataset contains only layers of z == 0|NaN: True\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "surface_idx = abs(xs['z']).where(xs['z'].notnull(), np.inf).argmin(dim='layer')  # NaN -> Inf\n",
    "surface_layers = xs.isel(layer=surface_idx)\n",
    "\n",
    "check = ((surface_layers['z'].notnull()) & (surface_layers['z'] != 0)).sum() == 0\n",
    "print(xs.sizes)\n",
    "print(surface_layers.sizes)\n",
    "print(f\"The remaining surface_layers dataset contains only layers of z == 0|NaN: {check.values}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f575998",
   "metadata": {},
   "source": [
    "You'll find an explanation of why these two datasets require different approaches in the box below."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e4cbcf7",
   "metadata": {},
   "source": [
    "```{admonition} Subsetting vs masking (and why NaNs remain)\n",
    ":class: warning\n",
    "\n",
    "In `xarray`, subsetting or dropping data does **not always remove every unwanted point**.\n",
    "Because `xarray` datasets represent **regular, gridded arrays**, each coordinate value must align consistently across all variables and dimensions.\n",
    "Even if a specific cell is missing (`NaN`) at one time or location, it may still exist elsewhere in the same grid position.\n",
    "Therefore, `xarray` keeps the full structure intact and marks invalid positions as `NaN` instead of removing them entirely.\n",
    "\n",
    "This means you should carefully decide whether your goal is to **reduce** the dataset (by dropping coordinates or dimensions) or to **mask** unwanted values (keeping the grid constant).\n",
    "Both approaches are useful:\n",
    "- **Reducing** the dataset is efficient when every coordinate slice of the dropped dimension is invalid.\n",
    "- **Masking** with `NaN` values preserves shape consistency, which is important for vectorized operations, merging, or time-series alignment.\n",
    "\n",
    "In the single-profile timeseries example above, extracting the surface layer was straightforward: every profile contained valid snow layers (i.e., `z` values).  \n",
    "\n",
    "For the larger, multi-profile dataset, however, not every profile had snow on the ground.  \n",
    "Attempting to reduce the dataset to only “valid” profiles would have been impossible, because at several timestamps profiles with and without snow on the ground co-exist.\n",
    "If `argmin()` get's all-`NaN` values from a snow-free profile, it cannot determine a valid layer index.\n",
    "To handle this, we replaced `NaN` values with `Inf` before applying `argmin()`.\n",
    "This ensures that the computed layer index points to a non-existent (and safely ignored) layer for those snow-free profiles, maintaining a consistent data shape across all locations.\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Classification extension\n",
    "\n",
    "`xsnow` includes a [classification extension](../api/_generated/xsnow.extensions.classification) that builds on simple masking to support criteria-based selection and multi-class classification.\n",
    "\n",
    "It lets you:\n",
    "\n",
    " * create boolean masks from readable expressions (e.g., `\"density > 300 & grain_size < 0.5\"`),  \n",
    "   or custom classification functions.\n",
    " * conveniently attach those masks as coordinates or data variables for tidy subsetting,\n",
    " * derive categorical classes (e.g., stability bins) from numeric variables, and\n",
    " * group spatially based on polygons.\n",
    "\n",
    "See the API reference for details and illustrative examples. Two similar examples are shown here to highlight the functionality:\n",
    "\n",
    "#### Classify layers into stability classes\n",
    "The following example classifies all layers into different stability classes. For simplicity, we simply use SK38 alone. You could obviously superimpose other conditions such as *RTA > 0.8* for real applications.\n",
    "The example \n",
    "* performs a clean classification (including best practices for metadata annotations)\n",
    "* prints a summary table of how many layers exist in each stability class\n",
    "* filters for very poor layers and masks others with NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "e7ae4bb7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SK38_class\n",
      "0     10817\n",
      "1      4613\n",
      "2       831\n",
      "3    274964\n",
      "Name: SK38_class, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from xsnow.extensions.classification import classify_by_bins\n",
    "\n",
    "# Prepare for classification\n",
    "sk38_bins = np.array([0.5, 0.95, 1.05])\n",
    "sk38_mapping = {\n",
    "    0: \"very poor (<=0.5)\",\n",
    "    1: \"poor (<0.95)\",\n",
    "    2: \"fair (<1.05)\",\n",
    "    3: \"rather stable (>1.05)\",\n",
    "}\n",
    "classifier_attrs = {\n",
    "    \"mapping\": sk38_mapping,\n",
    "    \"long_name\": \"SK38 stability class\",\n",
    "    \"classification_function\": \"classify_by_bins\",\n",
    "    \"bin_edges\": sk38_bins.tolist(),\n",
    "    \"right_closed\": True,\n",
    "}\n",
    "# Perform the classification\n",
    "xs2 = xs.classify(\n",
    "    classify_by_bins,\n",
    "    output_name=\"SK38_class\",\n",
    "    output_kind=\"coord\",\n",
    "    input_var=\"sk38\",\n",
    "    attrs=classifier_attrs,\n",
    "    bins=sk38_bins,\n",
    "    right=True,\n",
    ")\n",
    "\n",
    "# After classification, \n",
    "# get a summary table for number of layers per stability class\n",
    "print(\n",
    "    xs2.SK38_class.to_series().dropna().\\\n",
    "        groupby(xs2.SK38_class.to_series()).size()\n",
    ")\n",
    "\n",
    "# Or filter for those layers with very poor stability\n",
    "xs_sk38_very_poor = xs2.where(xs2.SK38_class == 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cfe1768",
   "metadata": {},
   "source": [
    "#### Classify profiles by spatial polygons\n",
    "This example shows how to\n",
    "* subset to all profiles from a specific region/polygon\n",
    "* aggregate by region and/or other coordinates (existing or custom ones)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "058ff896",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create two regions and leave one location outside\n",
    "regions = {\n",
    "    \"valley\":\"POLYGON ((11.15 47.12, 11.15 47.17, 11.33 47.17, 11.33 47.12, 11.15 47.12))\",\n",
    "    \"ridge\": \"POLYGON ((11.25 47.42, 11.25 47.46, 11.43 47.46, 11.43 47.42, 11.25 47.42))\",\n",
    "}\n",
    "xs_regions = xs.classify_spatially(regions, name=\"region\", kind=\"coord\",\n",
    "                                   polygons_crs=\"EPSG:4326\")\n",
    "\n",
    "# Subset to only one region\n",
    "xs_valley = xs_regions.where(xs_regions.region == \"valley\", drop=True)\n",
    "\n",
    "# Group by region and slope and aggregate\n",
    "xs_stats = xs_regions.groupby([\"region\", \"slope\"]).mean()\n",
    "\n",
    "# Group by region, slope, and elevation band\n",
    "# there is no band coordinate yet, so we create one with another classification:\n",
    "xs_regionbands = xs_regions.classify_criteria(\n",
    "    \"altitude >= 2000\",\n",
    "    name=\"band\",\n",
    "    kind=\"coord\",\n",
    "    attrs={\"mapping\": {False: \"<2000m\", True: \">=2000m\"}},\n",
    ")\n",
    "xs_regionband_average = xs_regionbands.groupby([\"region\", \"slope\", \"band\"]).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Summary\n",
    "\n",
    "- Use `.isel()` and `.sel()` for standard selection and conditional subsetting \n",
    "- Use `.where()` for conditional filtering and masking non-matching targets with `NaN`\n",
    "- Inspect `xs.sizes` and `xs.coords` to understand your selection context\n",
    "- Use the classification extension for solving more complex tasks conveniently\n"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "xsnow Documentation Team"
   }
  ],
  "kernelspec": {
   "display_name": "xsnow-dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}