Skip to content

Tabular data: the DataFrame container

sdata.sclass.dataframe.DataFrame is the self-describing tabular container (it supersedes the deprecated Data class). It wraps a pandas DataFrame together with per-column metadata and dataset-level metadata, and serializes to Parquet, Arrow/Feather, CSV, dict/JSON, JSON-LD/RDF, a Frictionless Data Package and HDF5 — with the qualifying metadata either embedded or written as an independent sidecar.

import pandas as pd
from sdata.sclass.dataframe import DataFrame

df = pd.DataFrame({"weight": [10, 20, 30], "height": [1.5, 1.6, 1.7]})
sdf = DataFrame(df=df, name="specimen_01", description="a tension test")

__init__ accepts df, column_metadata (a dict {col: {unit, label, ...}} or a Metadata) and any Base keyword (name, description, project, …). Passing no df yields an empty table.

The pandas frame

The wrapped frame is always available via sdf.df (settable). Thin convenience pass-throughs delegate to it:

sdf.df                 # the pandas DataFrame
len(sdf)               # number of rows
sdf.shape              # (3, 2)
sdf.columns            # Index(['weight', 'height'])
sdf.dtypes             # per-column pandas dtypes
sdf.head(2)            # first n rows
sdf.describe()         # descriptive statistics
repr(sdf)              # (DataFrame <…> shape=(3, 2))

Assigning a new frame (sdf.df = other) keeps the column metadata in sync — see Column metadata.

Dataset metadata

Every object carries fully-qualified, machine-readable dataset metadata (sdf.metadata), a free-text sdf.description, and a deterministic identity (sdf.name / sdf.sname / sdf.suuid). Reserved _sdata_* attributes (name, sname, suuid, class, ctime, parent, project, topology) are populated automatically.

sdf.metadata.add("max_force", 12.5, unit="kN", dtype="float",
                 description="max force", ontology="bfo:Quality")
sdf.metadata.df       # the metadata as a pandas table
sdf.udf               # only the user-defined attributes

See Machine-readable metadata for the metadata model, JSON-LD/RDF, schema validation and signing.

Column metadata

Each column carries an Attribute (unit/label/description/ontology/required) in sdf.column_metadata (a Metadata). Three views of the same store:

sdf.column_metadata    # the Metadata (one Attribute per column)
sdf.cmd                # alias of column_metadata
sdf.cmdf               # the column metadata rendered as a pandas DataFrame

Annotate columns with set_column (only the fields you pass are changed; existing annotations are preserved) and read them back with get_column:

sdf.set_column("weight", unit="kg", label="Gewicht", ontology="bfo:Quality")
sdf.set_column("height", unit="m")

sdf.get_column("weight").unit        # 'kg'
sdf.column_units                     # {'weight': 'kg', 'height': 'm'}

The col accessor offers attribute-style access with Jupyter tab-completion; the returned Attribute can be mutated in place:

sdf.col.weight                       # -> Attribute (tab-completion on df.col.)
sdf.col["weight"].unit = "kg"        # mutate a field in place

Sync & prune. When the frame is reassigned (sdf.df = other), column_metadata is kept in sync: new columns are added and annotations for removed columns are pruned, while surviving columns keep their unit/label. Column metadata supplied at construction time is preserved as-is (orphan keys — keys that match no column — are only logged as a warning, never dropped).

Serialization

sdata writes the data in an efficient, typed container and the metadata either embedded in that container or alongside it as a sidecar. Pick the format by the interop you need:

Format Write / read Metadata carrier Needs
Parquet .spq to_parquet / from_parquet, from_parquet_bytes _sdata JSON blob in the schema pyarrow
Arrow to_arrow / from_arrow _sdata blob + native per-column field metadata pyarrow
Feather to_feather / from_feather same as Arrow pyarrow
dict to_dict / from_dict base64 Parquet + explicit column_metadata pyarrow
JSON .sjson to_json / from_json via to_dict pyarrow
CSV to_csv / from_csv sidecar only (data-only file) — (pure pandas)
pandas df.attrs to_dataframe _sdata in df.attrs
JSON-LD / RDF to_jsonld / to_turtle / write_sidecar the metadata itself rdflib (optional)
Data Package .zip to_datapackage / from_datapackage datapackage.json (Frictionless) + lossless sdata block — (csv) / pyarrow (parquet)
HDF5 .h5 to_hdf / from_hdf _sdata node attribute (PyTables) tables (sdata[hdf])

All file writers share the same shape: an optional path (writes <sname>.<ext>), an optional exact filename, and a sidecar flag; without a path they return bytes (or, for CSV, a string).

# Parquet (.spq) — metadata embedded in the schema; zstd-compressed
sdf.to_parquet(path="out", sidecar=True)        # -> out/<sname>.spq + sidecar
DataFrame.from_parquet("out/<sname>.spq")
raw = sdf.to_parquet()                           # in-memory bytes
DataFrame.from_parquet_bytes(raw)

# Arrow / Feather — metadata in the Arrow schema (+ native per-column field metadata)
table = sdf.to_arrow()                           # pyarrow.Table
DataFrame.from_arrow(table)
sdf.to_feather(path="out")                       # -> out/<sname>.feather
DataFrame.from_feather("out/<sname>.feather")

# dict / JSON — for nesting in JSON documents
d = sdf.to_dict();      DataFrame.from_dict(d)
sdf.to_json("specimen_01.sjson", sidecar=True)   # text; from_json reconstructs

# CSV — data only (pure pandas); metadata via the sidecar
sdf.to_csv(path="out", sidecar=True)             # index dropped by default
DataFrame.from_csv("out/<sname>.csv")
text = sdf.to_csv()                              # CSV string

# hand back a plain pandas frame with sdata metadata in df.attrs["_sdata"]
plain = sdf.to_dataframe()

# linked data
sdf.to_jsonld();   sdf.to_turtle();   sdf.write_sidecar("out")
from sdata.base import Base
Base.read_sidecar("out/<sname>.meta.jsonld")     # read a sidecar back

Optional backend

Arrow, Feather and Parquet (and therefore to_dict/to_json) need pyarrow (pip install "sdata[parquet]"); CSV, to_dataframe and JSON-LD work with the core install. A missing backend raises a clear ImportError pointing at the extra.

Native per-column metadata (Arrow / Feather)

Besides the _sdata JSON blob, to_arrow() (and therefore to_feather()) attaches each column's unit/label/description/ontology natively to that column's Arrow field metadata. Arrow-aware tools (DuckDB, Polars, pyarrow) can read the per-column annotations without sdata, and from_arrow() merges field metadata from foreign Arrow tables back into column_metadata.

sdf.to_arrow().schema.field("weight").metadata   # {b'unit': b'kg', b'label': ...}

Data Package (portable bundle)

to_datapackage() writes a self-contained Frictionless Data Package (.zip): a standard datapackage.json descriptor (so generic Frictionless tooling can read it), the data as CSV (default, no extra dependency) or Parquet, and the full sdata metadata under the descriptor's "sdata" key for a lossless round-trip. The JSON-LD sidecar is embedded too (toggle with sidecar=).

sdf.to_datapackage(path="out")                  # -> out/<sname>.zip  (CSV inside)
sdf.to_datapackage(path="out", fmt="parquet")   # Parquet inside (needs sdata[parquet])
raw = sdf.to_datapackage()                       # zip bytes (no path)

DataFrame.from_datapackage("out/<sname>.zip")    # restores data + metadata losslessly

Column annotations map to Frictionless field properties (title←label, unit, rdfType←ontology, description), and the dtype to a Table Schema type (integer/number/boolean/datetime/string).

HDF5

For large/scientific data, to_hdf() writes an HDF5 file (PyTables) with the sdata metadata stored as the node's _sdata attribute; from_hdf() reads it back. Several DataFrames can share one file via distinct keys (HDF5 has no in-memory bytes form, so a path/filename is required). Needs pip install "sdata[hdf]".

sdf.to_hdf(path="out")                           # -> out/<sname>.h5  (key = sname)
DataFrame.from_hdf("out/<sname>.h5")             # default: first key in the file

# several tables in one file
sdf.to_hdf(filename="bundle.h5", key="run1")
other.to_hdf(filename="bundle.h5", key="run2")
DataFrame.from_hdf("bundle.h5", key="run2")

See RFC 0002 for the design rationale.

Table schema validation

A TableSchema declares the expected columns (reusing AttrSpec) and validates a DataFrame against them — missing columns, dtype mismatches (against df.dtypes), unit mismatches (against column_metadata) and extra, unspecified columns:

from sdata.schema import TableSchema, AttrSpec

schema = TableSchema("TensileTable", [
    AttrSpec("weight", dtype="int", unit="kg", required=True),
    AttrSpec("height", dtype="float", unit="m"),
])

report = sdf.validate_table(schema)   # ValidationReport (truthy if ok)
schema.apply(sdf)                      # fill missing column_metadata from the schema

A DataFrame subclass may set TABLE_SCHEMA to have its column_metadata auto-completed on construction; sdf.validate_table() then checks against it — analogous to Base.SDATA_SCHEMA for the dataset metadata.

Full API

See the API reference for the complete, auto-generated signature list.