How do I validate & annotate arbitrary data structures?¶
This guide walks through the low-level API that lets you validate iterables.
You can then use the records create inferred during validation to annotate a dataset.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.
CanCurate methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Record.public().
By default, from_values() considers a match in a public reference a validated value for any bionty entity.
# pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate-any
Define a test dataset.
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.create(
(3,),
dtype=[("temperature", "f8"), ("knockout_gene", "U15"), ("disease", "U16")],
store="data.zarr",
)
data["knockout_gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
data["disease"] = np.random.default_rng().choice(["MONDO:0004975", "MONDO:0004980"], 3)
→ connected lamindb: testuser1/test-curate-any
Validate and standardize vectors¶
validate() validates vectore-like values against reference values in a registry.
It returns a boolean vector indicating where a value has an exact match in the reference values.
bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
Show code cell output
! Your Disease registry is empty, consider populating it first!
→ use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False])
When validation fails, you can call inspect() to figure out what to do.
inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize() to standardize synonyms.
bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id)
Show code cell output
! received 2 unique terms, 1 empty/duplicated term is ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
detected 2 Disease terms in public source for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
→ add records from public source to your Disease registry via .from_values()
<lamin_utils._inspect.InspectResult at 0x7f6702de38c0>
Bulk creating records using from_values() only returns validated records.
diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id).save()
Repeat the process for more labels:
projects = ln.ULabel.from_values(
["Project A", "Project B"],
field=ln.ULabel.name,
create=True, # create non-validated labels
).save()
genes = bt.Gene.from_values(data["knockout_gene"], field=bt.Gene.ensembl_gene_id).save()
Annotate the dataset¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
Annotate with features:
ln.Feature(name="project", dtype=ln.ULabel).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
{"project": projects, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()
Show code cell output
Artifact .zarr ├── General │ ├── .uid = 'TrggH1Oae53IfIvy0000' │ ├── .key = 'my_dataset.zarr' │ ├── .size = 848 │ ├── .hash = 'SilFmsZ-n7ruAxHRzVSG7w' │ ├── .n_files = 2 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/faq/test-curate-any/.lamindb/TrggH1Oae53IfIvy.zarr │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-03-31 16:09:03 ├── Linked features │ └── disease cat[bionty.Disease.ontol… Alzheimer disease, atopic eczema │ knockout_gene cat[bionty.Gene.ensembl_… BRCA2, KRAS, TP53 │ project cat[ULabel] Project A, Project B └── Labels └── .genes bionty.Gene BRCA2, TP53, KRAS .diseases bionty.Disease atopic eczema, Alzheimer disease .ulabels ULabel Project A, Project B
Show code cell content
# clean up test instance
!rm -r data.zarr
!rm -r ./test-curate-any
!lamin delete --force test-curate-any
• deleting instance testuser1/test-curate-any