Tabular data: the `DataFrame` container¶

sdata.sclass.dataframe.DataFrame is the self-describing tabular container (it supersedes the deprecated Data class). It wraps a pandas DataFrame together with per-column metadata and dataset-level metadata, and serializes to Parquet, Arrow/Feather, CSV, dict/JSON, JSON-LD/RDF, a Frictionless Data Package and HDF5 — with the qualifying metadata either embedded or written as an independent sidecar.

import pandas as pd
from sdata.sclass.dataframe import DataFrame

df = pd.DataFrame({"weight": [10, 20, 30], "height": [1.5, 1.6, 1.7]})
sdf = DataFrame(df=df, name="specimen_01", description="a tension test")

__init__ accepts df, column_metadata (a dict {col: {unit, label, ...}} or a Metadata) and any Base keyword (name, description, project, …). Passing no df yields an empty table.

The pandas frame¶

The wrapped frame is always available via sdf.df (settable). Thin convenience pass-throughs delegate to it:

sdf.df                 # the pandas DataFrame
len(sdf)               # number of rows
sdf.shape              # (3, 2)
sdf.columns            # Index(['weight', 'height'])
sdf.dtypes             # per-column pandas dtypes
sdf.head(2)            # first n rows
sdf.describe()         # descriptive statistics
repr(sdf)              # (DataFrame <…> shape=(3, 2))

Assigning a new frame (sdf.df = other) keeps the column metadata in sync — see Column metadata.

Dataset metadata¶

Every object carries fully-qualified, machine-readable dataset metadata (sdf.metadata), a free-text sdf.description, and a deterministic identity (sdf.name / sdf.sname / sdf.suuid). Reserved _sdata_* attributes (name, sname, suuid, class, ctime, parent, project, topology) are populated automatically.

sdf.metadata.add("max_force", 12.5, unit="kN", dtype="float",
                 description="max force", ontology="bfo:Quality")
sdf.metadata.df       # the metadata as a pandas table
sdf.udf               # only the user-defined attributes

See Machine-readable metadata for the metadata model, JSON-LD/RDF, schema validation and signing.

Column metadata¶

Each column carries an Attribute (unit/label/description/ontology/required) in sdf.column_metadata (a Metadata). Three views of the same store:

sdf.column_metadata    # the Metadata (one Attribute per column)
sdf.cmd                # alias of column_metadata
sdf.cmdf               # the column metadata rendered as a pandas DataFrame

Annotate columns with set_column (only the fields you pass are changed; existing annotations are preserved) and read them back with get_column:

sdf.set_column("weight", unit="kg", label="Gewicht", ontology="bfo:Quality")
sdf.set_column("height", unit="m")

sdf.get_column("weight").unit        # 'kg'
sdf.column_units                     # {'weight': 'kg', 'height': 'm'}

The col accessor offers attribute-style access with Jupyter tab-completion; the returned Attribute can be mutated in place:

sdf.col.weight                       # -> Attribute (tab-completion on df.col.)
sdf.col["weight"].unit = "kg"        # mutate a field in place

Sync & prune. When the frame is reassigned (sdf.df = other), column_metadata is kept in sync: new columns are added and annotations for removed columns are pruned, while surviving columns keep their unit/label. Column metadata supplied at construction time is preserved as-is (orphan keys — keys that match no column — are only logged as a warning, never dropped).

Serialization¶

sdata writes the data in an efficient, typed container and the metadata either embedded in that container or alongside it as a sidecar. Pick the format by the interop you need:

Format	Write / read	Metadata carrier	Needs
Parquet `.spq`	`to_parquet` / `from_parquet`, `from_parquet_bytes`	`_sdata` JSON blob in the schema	pyarrow
Arrow	`to_arrow` / `from_arrow`	`_sdata` blob + native per-column field metadata	pyarrow
Feather	`to_feather` / `from_feather`	same as Arrow	pyarrow
dict	`to_dict` / `from_dict`	base64 Parquet + explicit `column_metadata`	pyarrow
JSON `.sjson`	`to_json` / `from_json`	via `to_dict`	pyarrow
CSV	`to_csv` / `from_csv`	sidecar only (data-only file)	— (pure pandas)
pandas `df.attrs`	`to_dataframe`	`_sdata` in `df.attrs`	—
JSON-LD / RDF	`to_jsonld` / `to_turtle` / `write_sidecar`	the metadata itself	rdflib (optional)
Data Package `.zip`	`to_datapackage` / `from_datapackage`	`datapackage.json` (Frictionless) + lossless `sdata` block	— (csv) / pyarrow (parquet)
HDF5 `.h5`	`to_hdf` / `from_hdf`	`_sdata` node attribute (PyTables)	tables (`sdata[hdf]`)

All file writers share the same shape: an optional path (writes <sname>.<ext>), an optional exact filename, and a sidecar flag; without a path they return bytes (or, for CSV, a string).

# Parquet (.spq) — metadata embedded in the schema; zstd-compressed
sdf.to_parquet(path="out", sidecar=True)        # -> out/<sname>.spq + sidecar
DataFrame.from_parquet("out/<sname>.spq")
raw = sdf.to_parquet()                           # in-memory bytes
DataFrame.from_parquet_bytes(raw)

# Arrow / Feather — metadata in the Arrow schema (+ native per-column field metadata)
table = sdf.to_arrow()                           # pyarrow.Table
DataFrame.from_arrow(table)
sdf.to_feather(path="out")                       # -> out/<sname>.feather
DataFrame.from_feather("out/<sname>.feather")

# dict / JSON — for nesting in JSON documents
d = sdf.to_dict();      DataFrame.from_dict(d)
sdf.to_json("specimen_01.sjson", sidecar=True)   # text; from_json reconstructs

# CSV — data only (pure pandas); metadata via the sidecar
sdf.to_csv(path="out", sidecar=True)             # index dropped by default
DataFrame.from_csv("out/<sname>.csv")
text = sdf.to_csv()                              # CSV string

# hand back a plain pandas frame with sdata metadata in df.attrs["_sdata"]
plain = sdf.to_dataframe()

# linked data
sdf.to_jsonld();   sdf.to_turtle();   sdf.write_sidecar("out")
from sdata.base import Base
Base.read_sidecar("out/<sname>.meta.jsonld")     # read a sidecar back

Optional backend

Arrow, Feather and Parquet (and therefore to_dict/to_json) need pyarrow (pip install "sdata[parquet]"); CSV, to_dataframe and JSON-LD work with the core install. A missing backend raises a clear ImportError pointing at the extra.

Native per-column metadata (Arrow / Feather)

Besides the _sdata JSON blob, to_arrow() (and therefore to_feather()) attaches each column's unit/label/description/ontology natively to that column's Arrow field metadata. Arrow-aware tools (DuckDB, Polars, pyarrow) can read the per-column annotations without sdata, and from_arrow() merges field metadata from foreign Arrow tables back into column_metadata.

sdf.to_arrow().schema.field("weight").metadata   # {b'unit': b'kg', b'label': ...}

Data Package (portable bundle)¶

to_datapackage() writes a self-contained Frictionless Data Package (.zip): a standard datapackage.json descriptor (so generic Frictionless tooling can read it), the data as CSV (default, no extra dependency) or Parquet, and the full sdata metadata under the descriptor's "sdata" key for a lossless round-trip. The JSON-LD sidecar is embedded too (toggle with sidecar=).

sdf.to_datapackage(path="out")                  # -> out/<sname>.zip  (CSV inside)
sdf.to_datapackage(path="out", fmt="parquet")   # Parquet inside (needs sdata[parquet])
raw = sdf.to_datapackage()                       # zip bytes (no path)

DataFrame.from_datapackage("out/<sname>.zip")    # restores data + metadata losslessly

Column annotations map to Frictionless field properties (title←label, unit, rdfType←ontology, description), and the dtype to a Table Schema type (integer/number/boolean/datetime/string).

HDF5¶

For large/scientific data, to_hdf() writes an HDF5 file (PyTables) with the sdata metadata stored as the node's _sdata attribute; from_hdf() reads it back. Several DataFrames can share one file via distinct keys (HDF5 has no in-memory bytes form, so a path/filename is required). Needs pip install "sdata[hdf]".

sdf.to_hdf(path="out")                           # -> out/<sname>.h5  (key = sname)
DataFrame.from_hdf("out/<sname>.h5")             # default: first key in the file

# several tables in one file
sdf.to_hdf(filename="bundle.h5", key="run1")
other.to_hdf(filename="bundle.h5", key="run2")
DataFrame.from_hdf("bundle.h5", key="run2")

See RFC 0002 for the design rationale.

Table schema validation¶

A TableSchema declares the expected columns (reusing AttrSpec) and validates a DataFrame against them — missing columns, dtype mismatches (against df.dtypes), unit mismatches (against column_metadata) and extra, unspecified columns:

from sdata.schema import TableSchema, AttrSpec

schema = TableSchema("TensileTable", [
    AttrSpec("weight", dtype="int", unit="kg", required=True),
    AttrSpec("height", dtype="float", unit="m"),
])

report = sdf.validate_table(schema)   # ValidationReport (truthy if ok)
schema.apply(sdf)                      # fill missing column_metadata from the schema

A DataFrame subclass may set TABLE_SCHEMA to have its column_metadata auto-completed on construction; sdf.validate_table() then checks against it — analogous to Base.SDATA_SCHEMA for the dataset metadata.

Full API¶

See the API reference for the complete, auto-generated signature list.

Tabular data: the DataFrame container¶