xsnow_io module#

Handles all data parsing and I/O for the xsnow package, enabling scalable and memory-efficient processing of large datasets using Dask.

This module is the primary engine for reading and writing snowpack data in various formats (e.g., SNOWPACK .pro, .smet, NetCDF).

Core Dask Integration#

To handle datasets that are larger than memory, this module employs a lazy, out-of-core loading strategy for .pro files using Dask. The process is:

  1. Pre-Scan for Metadata: A quick, parallel pre-scan of all .pro files is performed to determine essential metadata, such as the maximum number of snow layers (max_layers), without loading the full data into memory.

  2. Lazy Graph Construction: A Dask computation graph is built where each node represents the task of reading and processing a single .pro file into an xarray.Dataset. These tasks are “lazy” and are not executed immediately.

  3. Parallel Computation: When the data is needed, Dask’s scheduler executes the graph in parallel, reading files and creating datasets in chunks. This ensures that only a fraction of the total data resides in memory at any given time, enabling the processing of vast amounts of data on a single machine.

xsnow.xsnow_io.read(source, recursive=False, time=None, location=None, slope=None, realization=None, lazy=None, lazy_threshold=30, parallel_lazy_creation=True, n_cpus_use=None, logger=None, chunks=None, **parser_kwargs)#

Reads, parses, and combines snowpack data into a unified xsnowDataset.

This is the primary user-facing function for loading data. It orchestrates the discovery, parallel processing, and merging of snow profile (.pro) and time series (.smet) files into a single, coherent dataset.

Parameters:
  • source (Union[str, List[str], Path, List[Path]]) – The data source. Can be a path to a single file (e.g., ‘.pro’, ‘.smet’, ‘.nc’), a directory containing data files, or a list of file paths.

  • recursive (bool) – If True, search subdirectories recursively for .pro and .smet files when source is a directory. Default is False for backward compatibility and to avoid accidentally including unwanted files.

  • time (Optional[TimeSelector]) – A TimeSelector object for filtering data by time.

  • location (Optional[LocationSelector]) – A LocationSelector object for filtering by station ID.

  • slope (Optional[SlopeSelector]) – A SlopeSelector for filtering by slope angle.

  • realization (Optional[RealizationSelector]) – A RealizationSelector for filtering by model realization.

  • lazy (Optional[bool]) – If True, use Dask for lazy, memory-efficient loading (good for large datasets). If False, load all data eagerly into memory with parallel processing (good for small to medium datasets). If None (default), auto-detect based on file count (sweet spot to be determined).

  • lazy_threshold (int) – Number of files above which lazy loading is enabled when lazy=None (auto-detect). Default: 30 files.

  • parallel_lazy_creation (bool) – If True (default), parallelize the creation of lazy datasets using ThreadPoolExecutor. This speeds up the metadata parsing phase by 5-15x when working with hundreds or thousands of files. Recommended to keep True except for debugging.

  • n_cpus_use (Optional[int]) – The number of CPU cores to use. If not provided, defaults to min(32, total_cores - 1), with a minimum of 1.

  • logger (Optional[Logger]) – An optional, pre-configured logger instance.

  • chunks (Optional[Dict[str, int]]) – Chunk sizes for dask arrays when lazy=True. Dict with keys ‘time’ and ‘layer’. Default: {‘time’: 100, ‘layer’: -1} Example: {‘time’: 50, ‘layer’: 100}

  • **parser_kwargs (Any) –

    Additional keyword arguments passed to the underlying parser functions: - max_layers: Max layer dimension; ‘auto’ (default) scans files to pick

    the maximum, or set an int to force a fixed size.

    • force_smart_detection: If True, skip file validation for smart detection. Use when you know your files are heterogeneous but want fast loading anyway. Default: False.

    • validation_threshold_days: Maximum time spread (in days) for files to be considered homogeneous. Files modified beyond this threshold will trigger a validation warning. Default: 7.0 days.

    • trim_padding: If True, automatically trim NaN padding after loading. Only recommended for eager loading (lazy=False). For lazy datasets, this triggers computation. Default: False.

    • remove_soil: If True (default), drops soil layers during parsing.

    • add_surface_sh_as_layer: If True (default), inserts a surface hoar layer explicitly as layer when indicated by the PRO file special code for SH at surface.

    • norm_slopes: Controls slope normalization in SNOWPACK parsing / merging. Options:

      • ’auto’ (default): normalize when slopes are consistent across locations; preserve per-location slopes when inconsistent.

      • True: force normalization (collapse to (slope,) even if slopes vary by location).

      • False: skip normalization entirely.

Return type:

Optional[xsnowDataset]

Returns:

A unified xsnowDataset object containing the data from all sources, or None if no valid data files are found.

Examples

>>> # Auto-detect (lazy for many files, eager for few)
>>> ds = read("data/")
>>> # Force lazy loading for large datasets
>>> ds = read("large_archive/", lazy=True, n_cpus_use=8)
>>> # Force eager loading with trimmed padding
>>> ds = read("small_dataset/", lazy=False, trim_padding=True)
>>> # Force smart detection for heterogeneous files (use with caution)
>>> ds = read("mixed_archive/", force_smart_detection=True)
>>> # Force eager loading for quick access
>>> ds = read("small_dataset/", lazy=False, n_cpus_use=4)

See also

to_netcdf

Save datasets to NetCDF with proper formatting

xsnow.xsnow_io.read_smet(filepath, datetime_start=None, datetime_end=None, logger=None)#

Parses a single SMET file into an xarray.Dataset.

Reads a SMET-formatted text file, parsing the header for metadata and the data section into a pandas DataFrame, which is then converted into an xarray.Dataset. The data is filtered by the specified time range.

Parameters:
  • filepath (Union[str, Path]) – Path to the SMET file.

  • datetime_start (Optional[str]) – The start of the time range to read. Data before this time is dropped.

  • datetime_end (Optional[str]) – The end of the time range to read. Data after this time is dropped.

  • logger (Optional[logging.Logger]) – An optional, pre-configured logger instance. If None, the default module logger is used.

Return type:

Optional[Dataset]

Returns:

An xarray.Dataset containing the SMET data, or None if the file cannot be parsed or contains no data in the specified time range.

xsnow.xsnow_io.append_latest(existing_ds, source, from_time=None, join='left', logger=None, **kwargs)#

Incrementally extend a dataset by appending newer timestamps from files.

This method is a convenience wrapper for time selection, read, and concat. Semantics:

  1. Determine a start time:
    • If from_time is None, use (max(existing time) + 1s).

    • If from_time is provided, drop all times <= from_time from existing_ds and start reading from from_time (inclusive).

  2. Read source with that time filter.

  3. Concatenate along time using xsnow.concat.
    • join controls non-time dims (default: ‘left’ keeps existing domains).

    • Time is always the union along the concat axis (no extra reindexing).

Parameters:
  • existing_ds (xsnowDataset) – The dataset instance to extend.

  • source (Union[str, List[str], Path]) – Path(s) to files or directories to read new data from (calls read upon it).

  • from_time (Union[str, datetime, Timestamp, datetime64, None]) – Subset existing data to times < from_time and read new data starting with from_time (this will already be new data). (also read Semantics above) Accepts str/pandas/np datetime or datetime.

  • join (str) – Join mode for non-time dimensions forwarded to xsnow.concat.

  • logger (Optional[Logger]) – Optional logger to use for status/warning messages. If None, a logger is created.

  • **kwargs – Additional keyword args forwarded to xsnow.concat like compat or combine_attrs.

Return type:

xsnowDataset

Returns:

xsnowDataset with the appended data (or trimmed-only if nothing new).

xsnow.xsnow_io.to_netcdf(ds, path, logger=None, **kwargs)#

Saves the dataset to a NetCDF file. This method provides a convenient wrapper around the underlying xarray.Dataset.to_netcdf() method for easy caching and interoperability.

Automatically converts slope-varying coordinates (azimuth, aspect, inclination) to data variables to ensure files are mergeable without conflicts.

The location mapping dictionary is converted to NetCDF-compatible format by encoding it as JSON in a string attribute.

Parameters:
  • ds (xsnowDataset) – The dataset instance to save.

  • path (Union[str, Path]) – The destination file path for the .nc file.

  • logger (logging.Logger) – An optional preconfigured logger instance.

  • **kwargs – Additional keyword arguments passed to xarray.Dataset.to_netcdf().

xsnow.xsnow_io.to_smet(ds, path, max_files=None, **kwargs)#

Saves time-series data from the dataset to a SMET file.

This function extracts data variables that do not have a ‘layer’ dimension (e.g., meteorological data, total snow height) and writes them into the SMET format. It only supports writing data for a single location.

Parameters:
  • ds (xsnowDataset) – The dataset instance to save.

  • path (Union[str, Path]) – The destination file path for the .smet file.

  • max_files (int, optional) – The maximum number of locations allowed. If the dataset contains more locations, a ValueError is raised. Defaults to None (no limit).

  • **kwargs – Reserved for future filtering options.

xsnow.xsnow_io.to_pro(ds, path, max_files=None, **kwargs)#

Saves a single snow profile to a SNOWPACK .pro file.

This function iterates through each timestamp in the dataset and writes the vertical profile data (variables with a ‘layer’ dimension) into the .pro format. It only supports writing data for a single location.

Parameters:
  • ds (xsnowDataset) – The dataset instance to save.

  • path (Union[str, Path]) – The destination file path for the .pro file.

  • max_files (int, optional) – The maximum number of profiles (timestamps) allowed. If the dataset contains more profiles, a ValueError is raised. Defaults to None (no limit).

  • **kwargs – Reserved for future filtering options.

Raises:

ValueError – If the dataset is empty, contains more than one location, or if the number of profiles exceeds max_files.

xsnow.xsnow_io.to_zarr(ds, path, mode='w', append_dim=None, consolidated=True, **kwargs)#

Save dataset to Zarr format with automatic slope coordinate handling.

Zarr format provides better stability and performance for large datasets compared to NetCDF, with pure Python implementation (no HDF5 crashes).

This function automatically converts 1D slope coordinates (azimuth, inclination) to 2D data variables with (location, slope) dimensions, ensuring compatibility with the rest of xsnow and avoiding merge conflicts.

Parameters:
  • ds (xsnowDataset) – The xsnowDataset to save

  • path (Union[str, Path]) – Output path (should end in .zarr)

  • mode (Literal['w', 'w-', 'a', 'a-', 'r+', 'r']) – Write mode - ‘w’ (overwrite), ‘a’ (append), ‘w-’ (fail if exists)

  • append_dim (Optional[str]) – Dimension to append along (e.g., ‘time’ for time-series updates)

  • consolidated (bool) – Whether to consolidate metadata (improves read performance)

  • **kwargs – Additional arguments passed to xarray.Dataset.to_zarr()

Return type:

None

Returns:

None

Examples

>>> # Save complete dataset
>>> ds.to_zarr('archive_2024.zarr')
>>> # Append new data along time dimension
>>> new_data.to_zarr('archive_2024.zarr', mode='a', append_dim='time')
>>> # Create with custom compression
>>> import numcodecs
>>> encoding = {var: {'compressor': numcodecs.Blosc(cname='zstd', clevel=5)}
...            for var in ds.data_vars}
>>> ds.to_zarr('archive.zarr', encoding=encoding)

See also

read_zarr : Read Zarr archives to_netcdf : Save to NetCDF format

xsnow.xsnow_io.to_json(ds, path, **kwargs)#

Saves the dataset to a structured JSON file. (Not Implemented)

xsnow.xsnow_io.to_caaml(ds, path, **kwargs)#

Saves snow profile data to a CAAML V6.0 XML file. (Not Implemented)

xsnow.xsnow_io.to_crocus(ds, path, **kwargs)#

Saves snow profile data to a Crocus model input file. (Not Implemented)

xsnow.xsnow_io.trim_padding(ds, dims=None, keep_buffer=0, inplace=False)#

Remove NaN padding from layer dimension to reduce memory usage.

After reading files with max_layers padding, many trailing layers may be entirely NaN. This method detects the actual maximum layer with data and trims the excess.

IMPORTANT: For lazy-loaded datasets, this will trigger computation of all dask arrays to check for NaN values. For large datasets, consider using eager loading if you plan to trim padding, or trim after subsetting: ds.sel(location=’X’).compute().trim_padding()

Parameters:
  • ds (xsnowDataset) – The dataset instance to trim

  • dims (Optional[List[str]]) – List of dimensions to trim. Default: [‘layer’]

  • keep_buffer (int) – Number of extra layers to keep beyond the maximum non-NaN layer. Useful if you expect to add more data later. Default: 0 (trim to exact maximum)

  • inplace (bool) – If True, modify dataset in place. If False (default), return a new dataset.

Return type:

xsnowDataset

Returns:

Trimmed xsnowDataset (new instance if inplace=False)

Examples

>>> # Dataset padded to 50 layers, but only 12 have data
>>> ds.sizes['layer']
50
>>> ds_trimmed = ds.trim_padding()
>>> ds_trimmed.sizes['layer']
12
>>> # Keep 2 extra layers as buffer
>>> ds_trimmed = ds.trim_padding(keep_buffer=2)
>>> ds_trimmed.sizes['layer']
14
>>> # Trim in place
>>> ds.trim_padding(inplace=True)
>>> ds.sizes['layer']
12

Notes

  • Only works on dimensions where padding is at the end (e.g., layer)

  • Checks all data variables that have the specified dimension

  • For lazy datasets, this triggers computation - consider eager loading