{ "cells": [ { "cell_type": "markdown", "id": "a31f5122", "metadata": {}, "source": [ "# Combining datasets\n", "\n", "Before we start, think of profiles as the core units of our datasets: \n", "One profile consists of all layers and all variables at one coordinate combination of the core dimensions (location, time, slope, realization).\n", "\n", "You can choose to perform the following operations for *\"combining\"* datasets, which will be demonstrated in detail during this tutorial.\n", "\n", " 1. `xsnow.concat`: Stack them!\n", " * **Concatenate** datasets **along a single dimension** that has **no overlapping coordinates**.\n", " - Separate locations or times\n", " - Different realizations\n", " * Profiles originate fully from one specific source.\n", " * Convenience method for special case:\n", " - `.append_latest()`: Append new timesteps from source files\n", " 2. `xsnow.combine`: Choose one profile or the other!\n", " * **Combine** datasets with **overlapping profiles** while **preferring** profiles from one dataset.\n", " - *Filling data gaps*\n", " - *Updating* old datasets with new ones\n", " - *Preferring* one source on overlaps\n", " - *Adding new coordinates* (for non-overlapping coordinates, it behaves like `xsnow.concat`)\n", " * Never mixes sources or stratigraphies, \n", " the full profile always comes from one dataset and remains intact.\n", " * Convenience methods for special cases:\n", " - `.stitch_gaps_with()`: Fill data gaps conveniently without adding new coordinates.\n", " - `.overwrite_with()`: Overwrite existing dataset with new values and coordinates.\n", " 3. `xarray.merge`: Add new variables!\n", " * **Merging** datasets with different data variables\n", " - Both datasets are equally valid, ideally variables are identical or mutually exclusive\n", " - Apply with conservative conflict resolution strategies to avoid silent unexpected behavior\n", " * Data in resulting profiles originates from different datasets!\n", " * Convenience method for special case:\n", " - `.merge_smet()`: Merge SMET files from source into an existing `xsnowDataset`.\n", " 4. Combining stratigraphies: Edge case for experts only!\n", " * Concatenate or combine different layers that originate from different datasets\n", " - Apply `xarray.merge` with caution and heavy testing.\n" ] }, { "cell_type": "markdown", "id": "f3d48ca4", "metadata": {}, "source": [ "```{admonition} xsnow versus xarray semantics\n", ":class: warning\n", "\n", "`xsnowDataset`s have one peculiarity compared to \"standard\" xarray truly *gridded* datasets---snow layers exist on an irregular vertical grid. Therefore, `xsnow` provides its own functionality to *safely* concatenate and combine datasets by ensuring that profiles remain intact---individual layers won't be added or removed, the core unit of a profile always stays intact.\n", "```" ] }, { "cell_type": "markdown", "id": "0b9efc3a", "metadata": {}, "source": [ "This tutorial focuses on the `xsnow` functionality for combining datasets *safely* and only provides minimal background and guidance on using `xarray`-native functionality." ] }, { "cell_type": "markdown", "id": "9029b6f0", "metadata": {}, "source": [ "```{admonition} Broadcasting & alignment\n", ":class: note\n", "\n", "When working with datasets of different shapes, `xsnow` makes use of `xarray`'s dataset[**alignment**](https://docs.xarray.dev/en/stable/user-guide/terminology.html#term-Aligning) **by coordinate labels** and [**broadcasting**](https://docs.xarray.dev/en/stable/user-guide/terminology.html#term-Broadcasting) to compatible shapes. This is powerful and convenient---but you need to be aware of what aligns with what. So, note that operations align on coordinate labels and not on array order. If labels do not match (e.g., differing time stamps), values are still paired by label and **missing pairs become NaN**.\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "70fed490", "metadata": {}, "outputs": [], "source": [ "import xsnow" ] }, { "cell_type": "markdown", "id": "fe1ef2b1", "metadata": {}, "source": [ "## 1. Concatenating datasets along a single dimension\n", "If you want to combine multiple `xsnowDatasets` along a single dimension, use \n", "`xsnow.concat([...], dim=..., join=...)` to **stack them**. \n", "Common use cases:\n", "\n", "- **Locations**: combine independent sites into a larger domain\n", "- **Time**: stack non-overlapping time stamps or different seasons\n", "- **Realizations**: stack several simulation variants\n", "\n", "Note that there is a specific section on the special case *\"Updating datasets with recent data\"*. See table of contents on the right.\n", "\n", "### Concatenate by `location`\n", "To combine independent sites, you can concatenate along the `location` dimension. `xsnow` will then align the other dimensions (e.g., `time`, etc.). If a specific coordinate from another dimension, such as a timestamp, only exists in one of the two datasets, it will generate a `NaN` entry for the location with the missing timestamp. " ] }, { "cell_type": "code", "execution_count": 2, "id": "78641414", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n", "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 1, 'time': 416, 'slope': 1, 'realization': 1, 'layer': 12})\n", "Frozen({'location': 1, 'time': 416, 'slope': 1, 'realization': 1, 'layer': 10})\n" ] } ], "source": [ "# Read two datasets from independent sites\n", "datapath = xsnow.sample_data.snp_gridded_dir()\n", "xs1 = xsnow.read(f\"{datapath}/pros/gridded/VIR1A.pro\")\n", "xs2 = xsnow.read(f\"{datapath}/pros/gridded/VIR2A.pro\")\n", "print(xs1.sizes)\n", "print(xs2.sizes)" ] }, { "cell_type": "code", "execution_count": 3, "id": "61077b3b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 2, 'time': 416, 'slope': 1, 'realization': 1, 'layer': 12})\n" ] } ], "source": [ "# Concatenate VIR1A and VIR2A along the location dimension\n", "ds_cat = xsnow.concat([xs1, xs2], dim=\"location\")\n", "xs_cat = xsnow.xsnowDataset(ds_cat)\n", "print(xs_cat.sizes)" ] }, { "cell_type": "markdown", "id": "92e56b9d", "metadata": {}, "source": [ "As you see, the two different datasets `xs1` and `xs2` are not only from different locations, but because of that, they have slightly different numbers of timestamps and layers. `xsnow.concat` allows you to take deeper control over which variables to concatenate and how to handle potentially conflicting variables between datasets (e.g., duplicates). The example above uses an *\"outer join\"*, the union of all dataset coordinates. Check out the function documentation for more details or continue reading for more examples." ] }, { "cell_type": "markdown", "id": "0fe60942", "metadata": {}, "source": [ "```{admonition} Lazy combinations\n", ":class: warning\n", "\n", "Note, that with the default parameters, `xarray` will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily!\n", "```" ] }, { "cell_type": "markdown", "id": "1ddd2811", "metadata": {}, "source": [ "### Concatenate by `realization`\n", "At this stage, xsnow's `read` function does not know how to assign data points to different realizations. It will therefore place all data in one single realization, and it is up to the user to read multiple realizations separately and then combine them into a single `xsnowDataset`." ] }, { "cell_type": "code", "execution_count": 4, "id": "91aacb19", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 13 workers...\n", "[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n", "[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n", "[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 13 workers...\n", "[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n", "[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 2, 'time': 381, 'slope': 1, 'realization': 1, 'layer': 59})\n", "Frozen({'location': 2, 'slope': 1, 'time': 2, 'realization': 1, 'layer': 18})\n" ] } ], "source": [ "# Read two datasets: manual and simulated profiles\n", "datapath = xsnow.sample_data.snp_snowobs_dir()\n", "xs_sim = xsnow.read(f\"{datapath}/pros/\", recursive=True)\n", "xs_obs = xsnow.read(f\"{datapath}/pits/\")\n", "print(xs_sim.sizes)\n", "print(xs_obs.sizes)" ] }, { "cell_type": "markdown", "id": "3ad6e1da", "metadata": {}, "source": [ "The two datasets differ in the time and layer dimensions.\n", "\n", "The following concatenation uses an outer join and creates specific labels for each of the realization coordinate values by providing a Pandas Index as dimension argument. This helps keeping the dimension readable and facilitates selections on the label later." ] }, { "cell_type": "code", "execution_count": 5, "id": "5a00ba90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 2, 'time': 382, 'slope': 1, 'realization': 2, 'layer': 59})\n" ] } ], "source": [ "import pandas as pd\n", "# Concat into two different (renamed) realizations\n", "xs_cat = xsnow.concat(\n", " [xs_sim, xs_obs], \n", " dim=pd.Index([\"simulated\", \"observed\"], name=\"realization\"),\n", " join=\"outer\",\n", ")\n", "print(xs_cat.sizes)" ] }, { "cell_type": "markdown", "id": "ec8e0ffa", "metadata": {}, "source": [ "Let's also compute an inner join. This is where xsnow's safety mechanism kicks in. For an inner join, xarray's concat function would only return the 18 \"common\" layers, while we actually need all 59 layers in the concatenated dataset." ] }, { "cell_type": "code", "execution_count": 6, "id": "2ba0dff0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frozen({'location': 2, 'time': 1, 'slope': 1, 'realization': 2, 'layer': 59})\n" ] } ], "source": [ "# Perform a differet join mode:\n", "xs_cat = xsnow.concat(\n", " [xs_sim, xs_obs],\n", " dim=pd.Index([\"simulated\", \"observed\"], name=\"realization\"),\n", " join=\"inner\",\n", ")\n", "print(xs_cat.sizes)" ] }, { "cell_type": "markdown", "id": "5288aaad", "metadata": {}, "source": [ "The inner join results in one overlapping time stamp, while all the layers were preserved." ] }, { "cell_type": "markdown", "id": "212ce26b", "metadata": {}, "source": [ "## 2. Combining datasets with overlapping coordinates\n", "Use `xsnow.combine([...], join=...)` to combine overlapping datasets. They will first be aligned (and broadcast) according to the chosen join (e.g., 'outer', 'left', 'inner', etc.). At each overlapping coordinate, the function determines which dataset to choose and then takes all data points from all variables entirely from the left or right dataset, so that physical consistency at the profile-level is ensured. For that task, `xsnow.combine` chooses the leftmost dataset with a `profile_status > 0`, which represents valid data (possibly without snow) as opposed to unavailable data (`profile_status == 0`) or erroneous data (`profile_status < 0`). This mechanism ensures that for a given coordinate **all variables and all layers always originate either fully from the left or fully from the right dataset**.\n", "\n", "The context that requires such an operation may be the one of an outdated dataset and an updated simulation. To create illustrative examples, let's first build two small demo datasets:\n", "\n", "- `old`: ends earlier and contains one missing profile at noon.\n", "- `new`: overlaps with `old`, extends further in time, and modifies density values to mimic an updated simulation." ] }, { "cell_type": "code", "execution_count": 7, "id": "d69f6517", "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "def _format_time_summary(xs):\n", " fromT = pd.to_datetime(xs.time.min().values).strftime('%H:%M')\n", " toT = pd.to_datetime(xs.time.max().values).strftime('%H:%M')\n", " return f\"{xs.sizes['time']} timestamps {fromT}--{toT}\"" ] }, { "cell_type": "code", "execution_count": 8, "id": "918fdd6c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 2 datasets eagerly with 13 workers...\n", "[i] xsnow.utils: Slope coordinate 'inclination' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n", "[i] xsnow.utils: Slope coordinate 'azimuth' varies by location. Preserving (location, slope) dimensions as allow_per_location=True.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "old: 3 timestamps 16:00--18:00 (gap at 17:00)\n", "new: 3 timestamps 17:00--19:00\n" ] } ], "source": [ "import numpy as np\n", "\n", "xs = xsnow.single_profile_timeseries()\n", "\n", "old = xs.isel(time=slice(0, 3)).copy(deep=True)\n", "gap_time = old['time'].values[1]\n", "old = old.where(old['time'] != gap_time)\n", "old = old.assign_coords(z=old['z'].where(old['time'] != gap_time))\n", "\n", "new = xs.isel(time=slice(1, 4)).copy(deep=True)\n", "new[\"density\"] = new[\"density\"] + 100\n", "\n", "\n", "print(f\"old: {_format_time_summary(old)} (gap at {pd.to_datetime(gap_time).strftime('%H:%M')})\")\n", "print(f\"new: {_format_time_summary(new)}\")" ] }, { "cell_type": "code", "execution_count": 9, "id": "bf67afcc", "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "import xarray as xr\n", "def _print_profile_source(old, new, comb):\n", " sources = []\n", " old_times = set(old.time.to_index()) if 'time' in old.coords else set()\n", " new_times = set(new.time.to_index()) if 'time' in new.coords else set()\n", "\n", " for t in comb.time.to_index():\n", " comb_prof = comb.density.sel(time=t)\n", "\n", " take_old = False\n", " if t in old_times:\n", " old_prof = old.density.sel(time=t)\n", " comb_old, old_aligned = xr.align(comb_prof, old_prof, join=\"inner\")\n", " if comb_old.size and old_aligned.size:\n", " take_old = np.allclose(comb_old, old_aligned, equal_nan=True)\n", "\n", " take_new = False\n", " if t in new_times:\n", " new_prof = new.density.sel(time=t)\n", " comb_new, new_aligned = xr.align(comb_prof, new_prof, join=\"inner\")\n", " if comb_new.size and new_aligned.size:\n", " take_new = np.allclose(comb_new, new_aligned, equal_nan=True)\n", "\n", " if take_old and take_new:\n", " sources.append(\"both\")\n", " elif take_old:\n", " sources.append(\"old\")\n", " elif take_new:\n", " sources.append(\"new\")\n", " else:\n", " sources.append(\"nan\")\n", "\n", " print(\"Profile source after fill:\", sources)\n" ] }, { "cell_type": "markdown", "id": "5b24b8d8", "metadata": {}, "source": [ "Let's first combine `old` and `new` on the `old` domain and also preferring `old` on overlaps:" ] }, { "cell_type": "code", "execution_count": 10, "id": "2bfea787", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "old_updated: 3 timestamps 16:00--18:00\n", "Profile source after fill: ['old', 'new', 'old']\n" ] } ], "source": [ "old_updated = xsnow.combine([old, new], join='left')\n", "\n", "print(f\"old_updated: {_format_time_summary(old_updated)}\")\n", "_print_profile_source(old, new, old_updated)" ] }, { "cell_type": "markdown", "id": "8ff9cd94", "metadata": {}, "source": [ "Let's continue to keep the `old` domain but now prefer `new` on overlaps:" ] }, { "cell_type": "code", "execution_count": 11, "id": "ec9fb478", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "old_overwritten: 3 timestamps 16:00--18:00\n", "Profile source after fill: ['old', 'new', 'new']\n" ] } ], "source": [ "old_overwritten = xsnow.combine([new, old], join='right')\n", "\n", "print(f\"old_overwritten: {_format_time_summary(old_overwritten)}\")\n", "_print_profile_source(old, new, old_overwritten)" ] }, { "cell_type": "markdown", "id": "837b682f", "metadata": {}, "source": [ "Other join modes can be set, and join modes can even be set on a core-dimension-basis, such as `join_time`. Consult the documentation for information about other arguments to set, such as `compat` or `combine_attr`, which allow configuring behavior for conflicting coordinates or attributes.\n", "\n", "To make users lives even more convenient, xsnow implements two wrappers for common tasks, `xsnowDataset.fill_gaps_with(...)` (left join, prefer the existing dataset) and `xsnowDataset.overwrite_with(...)` (outer join, prefer the other dataset) that wrap `xsnow.combine`:" ] }, { "cell_type": "code", "execution_count": 12, "id": "08824c41", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "old_updated2: 3 timestamps 16:00--18:00\n", "Profile source after fill: ['old', 'new', 'old']\n", "old_updated2 is identical to old_updated: True\n" ] } ], "source": [ "old_updated2 = old.fill_gaps_with(new)\n", "\n", "print(f\"old_updated2: {_format_time_summary(old_updated2)}\")\n", "_print_profile_source(old, new, old_updated2)\n", "print(f\"old_updated2 is identical to old_updated: {old_updated.identical(old_updated2)}\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "880e600f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "old_overwritten2: 4 timestamps 16:00--19:00\n", "Profile source after fill: ['old', 'new', 'new', 'new']\n" ] } ], "source": [ "old_overwritten2 = old.overwrite_with(new)\n", "\n", "print(f\"old_overwritten2: {_format_time_summary(old_overwritten2)}\")\n", "_print_profile_source(old, new, old_overwritten2)" ] }, { "cell_type": "markdown", "id": "33bc11d1", "metadata": {}, "source": [ "As we see, `old_overwritten2` is different from `old_overwritten` in that it represents an outer join---in this case resulting in the new, fourth timestamp being added." ] }, { "cell_type": "markdown", "id": "0cde8cff", "metadata": {}, "source": [ "## 3. Merging new variables or coordinates into an xsnowDataset\n", "Now imagine you have two datasets, this time not with the notion of `old` and `new` but equally valid with different meteo and snow layer variables. It could well be that both datasets have valid profiles for a given coordinate and that you want to mix the variables from the two datasets. `xsnow.combine` is great at merging new coordinates as we saw already, but due to the semantics of keeping profiles intact and not mixing variables from different datasets, it is not an ideal tool for merging new data variables into the combined dataset. You would end up with a `NaN` value in the new variable from right wherever the left dataset has a valid profile. Luckily, this is what `xarray.merge` is designed for.\n", "\n", "In contrast to `xsnow.combine`, `xarray.merge` will take all data variables from the left dataset and include any new variables that the right dataset has. To ensure that the layer coordinates are really consistent between the two datasets (e.g., layer x is at the correct height/depth z), we recommend to specifically set safe conflict rules that make the operation raise exceptions early, so you won't suffer silent surprises (see example below). " ] }, { "cell_type": "markdown", "id": "663ec943", "metadata": {}, "source": [ "```{admonition} xarray.merge does not fill data gaps\n", ":class: note\n", "\n", "Note, that `xarray.merge` is meant to add new variables (or coordinates). *It does not fill data gaps!* While xarray does offer gap filling functionality, we strongly advise against using xarray's tools for that and offer `xsnow.combine` with its convenience wrapper as demonstrated in the previous section.\n", "```" ] }, { "cell_type": "markdown", "id": "45620ac1", "metadata": {}, "source": [ "In the following example, note that I extract the underlying `xarray.Dataset` from the `xsnowDataset` when calling `xarray.merge` (It doesn't know our data class!)." ] }, { "cell_type": "code", "execution_count": 14, "id": "9f985b4e", "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "\n", "# Two overlapping products with different variables\n", "# Product A: keep density for the earliest timestamps\n", "xs_A = xs.isel(time=slice(0, 3)).copy(deep=True)\n", "\n", "# Product B: overlapping time range, but a different density model\n", "xs_B = xs.isel(time=slice(1, 4)).copy(deep=True)\n", "xs_B[\"density_B\"] = xs_B[\"density\"] + 100\n", "\n", "xs_B_conflict = xs_B.copy(deep=True)\n", "xs_B_conflict[\"z\"] = xs_B_conflict[\"z\"] - 2 # <-- 2 cm layer offset" ] }, { "cell_type": "code", "execution_count": 15, "id": "f0f42b67", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conflict caught (as intended): MergeError\n" ] } ], "source": [ "# Strict merge will raise because both datasets provide \n", "# the variable 'density' with different values on the overlap\n", "try:\n", " xr.merge([xs_A.to_xarray(), xs_B_conflict.to_xarray()],\n", " join='inner',\n", " compat=\"no_conflicts\", combine_attrs=\"no_conflicts\")\n", "except Exception as exc:\n", " print(\"Conflict caught (as intended):\", type(exc).__name__)" ] }, { "cell_type": "code", "execution_count": 16, "id": "79d27991", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "merged: 2 timestamps 17:00--18:00\n", " contains ['density', 'density_B']\n", "Overlap time now carries both sources:\n", " non-null values in density: True\n", " non-null values in density_B: True\n" ] } ], "source": [ "# Rename the denisty variable in xs_B so both variables can coexist and merge safely\n", "merged = xr.merge([xs_A.to_xarray(), xs_B.to_xarray()],\n", " join='inner',\n", " compat=\"no_conflicts\", combine_attrs=\"no_conflicts\")\n", "\n", "merged = merged.drop_vars([var for var in merged.data_vars if \\\n", " not var in [\"density\", \"density_B\"]])\n", "\n", "print(f\"merged: {_format_time_summary(merged)}\")\n", "print(\" contains \", list(merged.data_vars))\n", "overlap_time = merged.time.values[1]\n", "print(\"Overlap time now carries both sources:\")\n", "print(f\" non-null values in density: {\n", " merged['density'].sel(time=overlap_time).notnull().any().values}\")\n", "print(f\" non-null values in density_B: {\n", " merged['density_B'].sel(time=overlap_time).notnull().any().values}\")\n", " \n" ] }, { "cell_type": "markdown", "id": "bde5d50c", "metadata": {}, "source": [ "Note, that the final object `merged` is an `xarray.Dataset`. You could easily convert it back (without cost) via `xs_merged = xsnow.xsnoeDataset(merged)`." ] }, { "cell_type": "markdown", "id": "abc728f4", "metadata": {}, "source": [ "### Merging of new non-layer variables\n", "Merging of new meteorological or other scalar variables into an existing `xsnowDataset` is primarily done by `xarray.merge`, analogous to the previous example. Since this is such a common case, however, we implemented the `merge_smet()` method that you got to know in [Reading and writing, Adding SMET files...](./reading_writing.ipynb). `merge_smet()` is convenient because you can merge upon reading from source files directly, and you have control over which realization to merge into.\n", "\n", "For completeness, we also show an example for this task, without relying on `merge_smet()`. Let's read stratigraphy and meteorological data into two separate datasets, then compute a moving average of air temperature and merge the meteorological dataset back into the stratigraphy dataset:" ] }, { "cell_type": "code", "execution_count": 17, "id": "e0472172", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "'TA_ma' in xs_merged: True\n", "Values as expected: True\n" ] } ], "source": [ "# Read smet and pro into separate datasets\n", "datapath = xsnow.sample_data.snp_gridded_dir()\n", "xs = xsnow.read(f\"{datapath}/pros/gridded/VIR1A.pro\")\n", "xs_smet = xsnow.read(f\"{datapath}/pros/gridded/VIR1A.smet\") # for demo we don't use merge_smet here\n", "\n", "# Create a moving average for a smet variable\n", "xs_smet['TA_ma'] = xs_smet['TA'].rolling(time=6, min_periods=1).mean()\n", "\n", "# Combine both datasets\n", "ds_merged = xr.merge([xs.to_xarray(),\n", " xs_smet.data.drop_vars(\"profile_status\")],\n", " join='left', compat=\"no_conflicts\", combine_attrs=\"no_conflicts\")\n", "xs_merged = xsnow.xsnowDataset(ds_merged)\n", "\n", "# Brief sanity check\n", "print(f\"'TA_ma' in xs_merged: {'TA_ma' in xs_merged}\")\n", "ta_smet, ta_merged = xr.align(xs_smet['TA_ma'], xs_merged['TA_ma'], join=\"inner\")\n", "try:\n", " xr.testing.assert_allclose(ta_smet, ta_merged)\n", " values_as_expected = True\n", "except ValueError:\n", " values_as_expected = False\n", "print(f\"Values as expected: {values_as_expected}\")" ] }, { "cell_type": "markdown", "id": "ed9778bc", "metadata": {}, "source": [ "As in the previous example, `xarray.merge` accepts `xarray.Dataset`s, which we can easily convert back after the merge. Also note, that I used two different ways to access the underlying `xarray.Dataset` (`.to_xarray()` as method, or `.data` as attribute). I also dropped the `profile_status` variable from the meteo dataset since it would raise a conflict exception and I want to keep the status from the stratigraphy dataset anyways." ] }, { "cell_type": "markdown", "id": "c610c1bb", "metadata": {}, "source": [ "## 4. Updating xsnowDatasets with recent data\n", "\n", "Most functionality related to updating `xsnowDatasets` with recent data has already been explained. Since it represents a common operation, we use this section to summarize the different options and introduce one new method. \n", "\n", "1. **Convenient, most powerful, but compute-intense**: `xsnow.combine` or the convenience methods `.stitch_gaps_with()` or `.overwrite_with()`.\n", "2. **Convenient, still versatile, computationally cheaper**: the convenience method `.append_latest()`, which wraps `xsnow.concat`.\n", "\n", "When you want to update an existing dataset with recent data, you should ask yourself whether you can pick *one global timestamp* from which onward you accept discarding the old data and accepting the new data. If so, concatenating the old dataset (prior to the cutoff time) and the new dataset (starting with the cutoff time) will be your cheapest approach to getting an updated dataset. If, however, you need to update your dataset more subtly, such as different locations may have different timestamps that you want to keep versus update, then you should choose to `xsnow.combine` your datasets. \n", "\n", "`xsnow.combine` and its convenience methods can be applied as demonstrated earlier. `.append_latest()` is a wrapper for `read`ing new data from source files and then `concat`enating them with the existing old dataset. It could be applied like `new = old.append_latest('path/to/new/files')` in which case it would read only those timestamps older than the ones in `old`. If you want to update from an earlier timestamp, you can provide that earlier timestamp as argument. Check out the function documentation for specifics. Instead of using the convenience method, you can also manually assemble an equivalent procedure, for example if both datasets have been read already:" ] }, { "cell_type": "code", "execution_count": 18, "id": "37599193", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "old_updated3: 4 timestamps 16:00--19:00\n", "Profile source after fill: ['old', 'old', 'new', 'new']\n" ] } ], "source": [ "old_part1 = old.isel(time=slice(0, 2))\n", "new_part2 = new.isel(time=new.time > old_part1.time.max())\n", "\n", "old_updated3 = xsnow.concat([old_part1, new_part2], dim='time')\n", "\n", "print(f\"old_updated3: {_format_time_summary(old_updated3)}\")\n", "_print_profile_source(old, new, old_updated3)" ] } ], "metadata": { "authors": [ { "name": "xsnow Documentation Team" } ], "kernelspec": { "display_name": "xsnow-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }