{ "cells": [ { "cell_type": "markdown", "id": "969341a3", "metadata": {}, "source": [ "# Reading and writing\n", "or more explicitly:\n", "# Reading and writing datasets and metadata\n", "\n", "This tutorial shows how `xsnow` handles data I/O and metadata:\n", "\n", "1) Read datasets\n", "2) Inspect dimensions, coordinates, and variables\n", "3) View and edit metadata\n", "4) Write datasets to file\n", "\n", "`xsnow` focuses on parsing SNOWPACK (PRO/SMET) and CROCUS (NetCDF) and will provide basic readers for CAAML/JSON. It also supports an efficient xsnow NetCDF format for caching large datasets.\n", "\n", "- **SNOWPACK**: native PRO and SMET\n", "- **CROCUS**: NetCDF *(planned)*\n", "- **CAAML**: XML/JSON *(planned)*\n", "- **xsnow NetCDF**: optimized for `xsnow`’s data model `xsnowDataset`" ] }, { "cell_type": "markdown", "id": "87cb5286", "metadata": {}, "source": [ "```{admonition} Sample datasets\n", ":class: note\n", "\n", "`xsnow` ships several sample datasets. Two lightweight ones are available via `xsnow.single_profile()` and `xsnow.single_profile_timeseries()`. More datasets live in [`xsnow.sample_data`](../api/_generated/xsnow.sample_data) (see API docs).\n", "```" ] }, { "cell_type": "markdown", "id": "f71271e1", "metadata": {}, "source": [ "```{admonition} Quick overview over relevant functions\n", ":class: note\n", "\n", "| Task | Function/Method |\n", "|---------|---------|\n", "| Read any file into `xsnowDataset` | `read()` |\n", "| Write `xsnowDataset` to file | `.to_netcdf()`, `.to_pro()`, `to_smet()` |\n", "| Append new timesteps from files | `.append_latest()` |\n", "| Merge SMET files | `.merge_smet()` |\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "793e734f", "metadata": {}, "source": [ "## 1. Reading example SNOWPACK datasets\n", "For this SNOWPACK section, we use a sample dataset of gridded simulations shipped in native SNOWPACK formats. Get the sample data path:" ] }, { "cell_type": "code", "execution_count": 1, "id": "b5dc1067", "metadata": { "tags": [ "remove-output" ] }, "outputs": [], "source": [ "import xsnow\n", "datapath = xsnow.sample_data.snp_gridded_dir()" ] }, { "cell_type": "markdown", "id": "70f92471", "metadata": {}, "source": [ "The function returns a directory path. Its structure looks like:" ] }, { "cell_type": "code", "execution_count": 2, "id": "030c182b", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data location: /home/flo/.cache/xsnow-snp-gridded\n", "xsnow-snp-gridded/\n", " pros/\n", " gridded/\n", " VIR1A.pro\n", " VIR1A.smet\n", " VIR1A1.pro\n", " ...\n", " VIR5A4.pro\n", " VIR5A4.smet\n", " smets/\n", " gridded/\n", " forecast/\n", " VIR1A.smet\n", " VIR2A.smet\n", " VIR3A.smet\n", " VIR4A.smet\n", " VIR5A.smet\n", " nowcast/\n", " VIR1A.smet\n", " VIR2A.smet\n", " VIR3A.smet\n", " VIR4A.smet\n", " VIR5A.smet\n" ] } ], "source": [ "# cell hidden through metadata\n", "import os\n", "print(f\"Data location: {datapath}\")\n", "for root, dirs, files in os.walk(datapath):\n", " level = root.replace(datapath, \"\").count(os.sep)\n", " indent = \" \" * 4 * level\n", " print(f\"{indent}{os.path.basename(root)}/\")\n", " subindent = \" \" * 4 * (level + 1)\n", " fcounter = 0\n", " for f in sorted(files):\n", " if fcounter < 3 or fcounter > len(files)-3:\n", " print(f\"{subindent}{f}\")\n", " elif fcounter == 3:\n", " print(f\"{subindent}...\")\n", " fcounter += 1\n" ] }, { "cell_type": "markdown", "id": "2e967e8a", "metadata": {}, "source": [ "### Reading a single file\n", "Read a single **PRO** file:" ] }, { "cell_type": "code", "execution_count": 3, "id": "5f433321", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Locations: 1\n", " Timestamps: 416 (2024-01-16--2024-02-02)\n", " Profiles: 416 total | 416 valid | 373 with HS>0\n", "\n", " employing the Size: 545kB\n", " Dimensions: (location: 1, time: 416, slope: 1,\n", " realization: 1, layer: 12)\n", " Coordinates:\n", " altitude (location) float64 8B 2.372e+03\n", " latitude (location) float64 8B 47.15\n", " * location (location) \n", " Locations: 1\n", " Timestamps: 416 (2024-01-16--2024-02-02)\n", " Profiles: 416 total | 0 valid | unavailable with HS>0\n", "\n", " employing the Size: 213kB\n", " Dimensions: (location: 1, time: 416, slope: 1, realization: 1)\n", " Coordinates:\n", " altitude (location) float64 8B 2.372e+03\n", " latitude (location) float64 8B 47.15\n", " * location (location) object 8B 'VIR1A'\n", " longitude (location) float64 8B 11.19\n", " * time (time) datetime64[ns] 3kB 2024-01-16T05:00:00 ... 2...\n", " azimuth (slope) float64 8B 0.0\n", " inclination (slope) float64 8B 0.0\n", " * slope (slope) int64 8B 0\n", " * realization (realization) int64 8B 0\n", " Data variables: (12/64)\n", " ColdContentSnow (location, time, slope, realization) float64 3kB 0....\n", " DW (location, time, slope, realization) float64 3kB 29...\n", " HN12 (location, time, slope, realization) float64 3kB 0....\n", " HN24 (location, time, slope, realization) float64 3kB 0....\n", " HN3 (location, time, slope, realization) float64 3kB 0....\n", " HN6 (location, time, slope, realization) float64 3kB 0....\n", " ... ...\n", " zS4 (location, time, slope, realization) float64 3kB 0....\n", " zS5 (location, time, slope, realization) float64 3kB 0....\n", " zSd (location, time, slope, realization) float64 3kB 0....\n", " zSn (location, time, slope, realization) float64 3kB 0....\n", " zSs (location, time, slope, realization) float64 3kB 0....\n", " profile_status (location, time, slope, realization) int8 416B 0 ... 0\n", " Attributes:\n", " Conventions: CF-1.8\n", " crs: EPSG:4326\n" ] } ], "source": [ "# Read a single smet file\n", "xs = xsnow.read(f\"{datapath}/pros/gridded/VIR1A.smet\")\n", "print(xs)" ] }, { "cell_type": "markdown", "id": "8e505ab5", "metadata": {}, "source": [ "### Reading multiple files\n", "Read multiple files by (1) passing a list of paths or (2) pointing to a directory. Files are merged on shared coordinates (e.g., `location`, `time`). Note that for parsing SNOWPACK formats, the merging is based on identical `StationName`s and `station_name`s (properties of PRO and SMET files). Therefore, please don't use the same station names for different locations." ] }, { "cell_type": "code", "execution_count": 5, "id": "e0076599", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n" ] } ], "source": [ "# Reading a list of filepaths\n", "xs = xsnow.read([\n", " f\"{datapath}/pros/gridded/VIR1A.pro\", # profile data\n", " f\"{datapath}/pros/gridded/VIR1A.smet\", # scalar output data\n", " f\"{datapath}/smets/gridded/nowcast/VIR1A.smet\", # weather input data\n", "])" ] }, { "cell_type": "code", "execution_count": 6, "id": "2724b140", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Using lazy loading with Dask\n", "[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...\n", "[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user\n", "[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)\n" ] } ], "source": [ "# Reading all files within one directory\n", "xs = xsnow.read(f\"{datapath}/pros/gridded/\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "90169b64", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Locations: 5\n", " Timestamps: 416 (2024-01-16--2024-02-02)\n", " Profiles: 10400 total | 10400 valid | 8861 with HS>0\n", "\n", " employing the Size: 42MB\n", " Dimensions: (location: 5, time: 416, slope: 5,\n", " realization: 1, layer: 33)\n", " Coordinates:\n", " altitude (location) float64 40B 2.372e+03 ... 2.066e+03\n", " latitude (location) float64 40B 47.15 47.44 ... 47.44 47.37\n", " * location (location) \n", " Data variables: (12/91)\n", " ColdContentSnow (location, time, slope, realization) float64 83kB ...\n", " DW (location, time, slope, realization) float64 83kB ...\n", " HN12 (location, time, slope, realization) float64 83kB ...\n", " HN24 (location, time, slope, realization) float64 83kB ...\n", " HN3 (location, time, slope, realization) float64 83kB ...\n", " HN6 (location, time, slope, realization) float64 83kB ...\n", " ... ...\n", " zS4 (location, time, slope, realization) float64 83kB ...\n", " zS5 (location, time, slope, realization) float64 83kB ...\n", " zSd (location, time, slope, realization) float64 83kB ...\n", " zSn (location, time, slope, realization) float64 83kB ...\n", " zSs (location, time, slope, realization) float64 83kB ...\n", " HS (location, time, slope, realization) float32 42kB dask.array\n", " Attributes:\n", " Conventions: CF-1.8\n", " crs: EPSG:4326\n" ] } ], "source": [ "print(xs)" ] }, { "cell_type": "markdown", "id": "414f9367", "metadata": {}, "source": [ "### Adding SMET files to an existing dataset\n", "Add `.smet` files to your `xsnowDataset` after reading it with `xsnowDataset.merge_smet()`. You can pass a single file, a list of files, or a directory." ] }, { "cell_type": "code", "execution_count": 8, "id": "77258c07", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Loading 1 datasets eagerly with 13 workers...\n" ] } ], "source": [ "xs = xsnow.read(f\"{datapath}/pros/gridded/VIR1A.pro\")\n", "xs = xs.merge_smet(f\"{datapath}/pros/gridded/VIR1A.smet\")" ] }, { "cell_type": "markdown", "id": "dbca8a8e", "metadata": {}, "source": [ "```{tip}\n", "Use `var_prefix` to avoid name collisions (e.g., input SMET variables as `input_TA`, `input_PSUM`, etc.).\n", "This keeps data organized and prevents overwriting variables with identical names across input/output SMETs.\n", "```" ] }, { "cell_type": "markdown", "id": "eb07a6f8", "metadata": {}, "source": [ "### Appending new timesteps from updated source files\n", "If your simulations run operationally during a season on a day-by-day basis, you may want to append new times to the existing dataset. It is much more efficient to cache the dataset in a binary format (e.g., NetCDF) and only read new timesteps from the PRO and SMET files. The method `xsnowDataset.append_latest()` allows you to do that conveniently. Make sure you also read the dedicated section on *Updating xsnowDatasets with recent data* in the [Combining datasets](./combining_data.ipynb) tutorial." ] }, { "cell_type": "code", "execution_count": 9, "id": "a481720f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Appended 58 new timestamps: 2024-01-31T03:00:00--2024-02-02T12:00:00\n" ] } ], "source": [ "nowcast = xsnow.read(f\"{datapath}/smets/gridded/nowcast\")\n", "nowcast_forecast = nowcast.append_latest(f\"{datapath}/smets/gridded/forecast\")" ] }, { "cell_type": "markdown", "id": "65316b2c", "metadata": {}, "source": [ "## 2. Lazy versus eager loading\n", "When reading large datasets into the `xsnow` gridded data structure, it’s important to understand the difference between lazy and eager loading — and how this affects memory usage.\n", "\n", " * Eager loading means: as soon as you call the read function, the **entire dataset is loaded into memory** (i.e., into a concrete `xr.Dataset`). Further operations are applied on that in-memory object.\n", "\n", " * Lazy loading means: the read function returns a deferred object (or a data structure with deferred chunks) that **doesn’t immediately load all data**. You can continue chaining selections/transformations, and only when you trigger a final compute will the data actually be read from disk and loaded into memory.\n", "\n", "For more information, see [Handling large datasets](./lazy_analysis). " ] }, { "cell_type": "markdown", "id": "b56a815d", "metadata": {}, "source": [ "## 3. Subsetting data while reading\n", "\n", "As a convenience, the dataset can already be subset during the call to `read()`. The entire dataset is being parsed first, and then subset to the desired coordinates. This is particularly powerful if you load your data **lazily** first---*\"load only what you need\"*. For example:\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "0e055eee", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[i] xsnow.xsnow_io: Using lazy loading with Dask\n", "[i] xsnow.xsnow_io: Creating lazy datasets backed by dask arrays...\n", "[i] xsnow.xsnow_io: Data will NOT be computed until explicitly requested by user\n", "[i] xsnow.xsnow_io: Created 25 lazy datasets (data NOT yet loaded into memory)\n" ] } ], "source": [ "subset = xsnow.read(f\"{datapath}/pros/gridded/\", \n", " location=['VIR1A', 'VIR3A'],\n", " slope={'inclination': 38, 'azimuth': 180},\n", " time=slice('2024-01-16T10:00', '2024-01-16T12:00'),\n", " lazy=True)" ] }, { "cell_type": "markdown", "id": "34896154", "metadata": {}, "source": [ "## 4. Inspecting structure and metadata\n", "You can use standard xarray accessors to explore shape and metadata:" ] }, { "cell_type": "code", "execution_count": 11, "id": "eac1fc2f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Frozen({'location': 1, 'time': 416, 'slope': 1, 'realization': 1, 'layer': 12})" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List dimensions and sizes\n", "xs.sizes" ] }, { "cell_type": "code", "execution_count": 12, "id": "2cc1403d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Coordinates:\n", " altitude (location) float64 8B 2.372e+03\n", " latitude (location) float64 8B 47.15\n", " * location (location) Size: 20kB\n", "array([[[[[ nan, nan, nan, ..., nan, nan, nan]]],\n", "\n", "\n", " [[[ nan, nan, nan, ..., nan, nan, nan]]],\n", "\n", "\n", " [[[ nan, nan, nan, ..., nan, nan, nan]]],\n", "\n", "\n", " ...,\n", "\n", "\n", " [[[213.9, 209. , 223.7, ..., 171.9, 142.4, 124.9]]],\n", "\n", "\n", " [[[214. , 209.2, 223.7, ..., 172.6, 143. , 125. ]]],\n", "\n", "\n", " [[[214.2, 209.3, 223.8, ..., 173.3, 143.6, 125. ]]]]],\n", " shape=(1, 416, 1, 1, 12), dtype=float32)\n", "Coordinates:\n", " altitude (location) float64 8B 2.372e+03\n", " latitude (location) float64 8B 47.15\n", " * location (location)