lamindb.Feature¶
- class lamindb.Feature(name: str, dtype: Dtype | Registry | list[Registry] | FieldAttr, type: Feature | None = None, is_type: bool = False, unit: str | None = None, description: str | None = None, synonyms: str | None = None, nullable: bool = True, default_value: str | None = None, coerce_dtype: bool = False, cat_filters: dict[str, str] | None = None)¶
Bases:
Record
,CanCurate
,TracksRun
,TracksUpdates
Dataset dimensions.
A feature represents a dimension of a dataset, such as a column in a
DataFrame
. TheFeature
registry organizes metadata of features.The
Feature
registry helps you organize and query datasets based on their features and corresponding label annotations. For instance, when working with a “T cell” label, it could be measured through different features such as"cell_type_by_expert"
where an expert manually classified the cell, or"cell_type_by_model"
where a computational model made the classification.The two most important metadata of a feature are its
name
and thedtype
. In addition to typical data types, LaminDB has a"num"
dtype
to concisely denote the union of all numerical types.- Parameters:
name –
str
Name of the feature, typically. column name.dtype –
Dtype | Registry | list[Registry] | FieldAttr
SeeDtype
. For categorical types, you can define to which registry values are restricted, e.g.,ULabel
or[ULabel, bionty.CellType]
.unit –
str | None = None
Unit of measure, ideally SI ("m"
,"s"
,"kg"
, etc.) or"normalized"
etc.description –
str | None = None
A description.synonyms –
str | None = None
Bar-separated synonyms.nullable –
bool = True
Whether the feature can have null-like values (None
,pd.NA
,NaN
, etc.), seenullable
.default_value –
Any | None = None
Default value for the feature.coerce_dtype –
bool = False
When True, attempts to coerce values to the specified dtype during validation, seecoerce_dtype
.cat_filters –
dict[str, str] | None = None
Subset a registry by additional filters to define valid categories.
Note
For more control, you can use
bionty
registries to manage simple biological entities like genes, proteins & cell markers. Or you define custom registries to manage high-level derived features like gene sets.See also
Example
A simple
"str"
feature.>>> ln.Feature( ... name="sample_note", ... dtype="str", ... ).save()
A dtype
"cat[ULabel]"
can be more easily passed as below.>>> ln.Feature( ... name="project", ... dtype=ln.ULabel, ... ).save()
A dtype
"cat[ULabel|bionty.CellType]"
can be more easily passed as below.>>> ln.Feature( ... name="cell_type", ... dtype=[ln.ULabel, bt.CellType], ... ).save()
Hint
Features and labels denote two ways of using entities to organize data:
A feature qualifies what is measured, i.e., a numerical or categorical random variable
A label is a measured value, i.e., a category
Consider annotating a dataset by that it measured expression of 30k genes: genes relate to the dataset as feature identifiers through a feature set with 30k members. Now consider annotating the artifact by whether that it measured the knock-out of 3 genes: here, the 3 genes act as labels of the dataset.
Re-shaping data can introduce ambiguity among features & labels. If this happened, ask yourself what the joint measurement was: a feature qualifies variables in a joint measurement. The canonical data matrix lists jointly measured variables in the columns.
Attributes¶
- property coerce_dtype: bool¶
Whether dtypes should be coerced during validation.
For example, a
objects
-dtyped pandas column can be coerced tocategorical
and would pass validation if this is true.
- property default_value: Any¶
A default value that overwrites missing values (default
None
).This takes effect when you call
Curator.standardize()
.If
default_value = None
, missing values likepd.NA
ornp.nan
are kept.
- property nullable: bool¶
Indicates whether the feature can have nullable values (default
True
).Example:
import lamindb as ln import pandas as pd disease = ln.Feature(name="disease", dtype=ln.ULabel, nullable=False).save() schema = ln.Schema(features=[disease]).save() dataset = {"disease": pd.Categorical([pd.NA, "asthma"])} df = pd.DataFrame(dataset) curator = ln.curators.DataFrameCurator(df, schema) try: curator.validate() except ln.errors.ValidationError as e: assert str(e).startswith("non-nullable series 'disease' contains null values")
Simple fields¶
- uid: str¶
Universal id, valid across DB instances.
- name: str¶
Name of feature (hard unique constraint
unique=True
).
- is_type: bool¶
Distinguish types from instances of the type.
- unit: str | None¶
Unit of measure, ideally SI (
m
,s
,kg
, etc.) or ‘normalized’ etc. (optional).
- description: str | None¶
A description.
- array_rank: int¶
Rank of feature.
Number of indices of the array: 0 for scalar, 1 for vector, 2 for matrix.
Is called
.ndim
innumpy
andpytorch
but shouldn’t be confused with the dimension of the feature space.
- array_size: int¶
Number of elements of the feature.
Total number of elements (product of shape components) of the array.
A number or string (a scalar): 1
A 50-dimensional embedding: 50
A 25 x 25 image: 625
- array_shape: list[int] | None¶
Shape of the feature.
A number or string (a scalar): [1]
A 50-dimensional embedding: [50]
A 25 x 25 image: [25, 25]
Is stored as a list rather than a tuple because it’s serialized as JSON.
- proxy_dtype: Dtype | None¶
Proxy data type.
If the feature is an image it’s often stored via a path to the image file. Hence, while the dtype might be image with a certain shape, the proxy dtype would be str.
- synonyms: str | None¶
Bar-separated (|) synonyms (optional).
- created_at: datetime¶
Time of creation of record.
- updated_at: datetime¶
Time of last update to record.
Relational fields¶
-
type:
Feature
| None¶ Type of feature (e.g., ‘Readout’, ‘Metric’, ‘Metadata’, ‘ExpertAnnotation’, ‘ModelPrediction’).
Allows to group features by type, e.g., all read outs, all metrics, etc.
- values: FeatureValue¶
Values for this feature.
- projects¶
Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.
In the example:
class Pizza(Model): toppings = ManyToManyField(Topping, related_name='pizzas')
Pizza.toppings
andTopping.pizzas
areManyToManyDescriptor
instances.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame
.By default, shows all direct fields, except
updated_at
.Use arguments
include
orfeature
to include other data.- Parameters:
include (
str
|list
[str
] |None
, default:None
) – Related fields to include as columns. Takes strings of form"ulabels__name"
,"cell_types__name"
, etc. or a list of such strings.features (
bool
|list
[str
], default:False
) – IfTrue
, map all features of theFeature
registry onto the resultingDataFrame
. Only available forArtifact
.limit (
int
, default:100
) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame
:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact
:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Q
objects.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A
QuerySet
.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_df(df, field=None)¶
Create Feature records for columns.
- Return type:
- classmethod from_values(values, field=None, create=False, organism=None, source=None, mute=False)¶
Bulk create validated records by parsing values for an identifier such as a name or an id).
- Parameters:
values (
list
[str
] |Series
|array
) – A list of values for an identifier, e.g.["name1", "name2"]
.field (
str
|DeferredAttribute
|None
, default:None
) – ARecord
field to look up, e.g.,bt.CellMarker.name
.create (
bool
, default:False
) – Whether to create records if they don’t exist.organism (
Record
|str
|None
, default:None
) – Abionty.Organism
name or record.source (
Record
|None
, default:None
) – Abionty.Source
record to validate against to create records for.mute (
bool
, default:False
) – Whether to mute logging.
- Return type:
- Returns:
A list of validated records. For bionty registries. Also returns knowledge-coupled records.
Notes
For more info, see tutorial: Manage biological registries.
Example:
import bionty as bt # Bulk create from non-validated values will log warnings & returns empty list ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"]) assert len(ulabels) == 0 # Bulk create records from validated values returns the corresponding existing records ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], create=True).save() assert len(ulabels) == 3 # Bulk create records from public reference bt.CellType.from_values(["T cell", "B cell"]).save()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples:
ulabel = ln.ULabel.get("FvtpPJLJ") ulabel = ln.ULabel.get(name="my-label")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list
[str
] |Series
|array
) – Values that will be checked against the field.field (
str
|DeferredAttribute
|None
, default:None
) – The field of values. Examples are'ontology_id'
to map against the source ID or'name'
to map against the ontologies field names.mute (
bool
, default:False
) – Whether to mute logging.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to inspect against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
See also
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # inspect gene symbols gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human") assert result.validated == ["A1CF", "A1BG"] assert result.non_validated == ["FANCD1", "FANCD20"]
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str
|DeferredAttribute
|None
, default:None
) – The field to look up the values for. Defaults to first string field.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. IfNone
, returns the whole record.
- Return type:
NamedTuple
- Returns:
A
NamedTuple
of lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str
) – The input string to match against the field ontology values.field (
str
|DeferredAttribute
|None
, default:None
) – The field or fields to search. Search all string fields by default.limit (
int
|None
, default:20
) – Maximum amount of top results to return.case_sensitive (
bool
, default:False
) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrame
of search results with a score in columnscore
. Ifreturn_queryset
isTrue
.QuerySet
.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
Iterable
) – Identifiers that will be standardized.field (
str
|DeferredAttribute
|None
, default:None
) – The field representing the standardized names.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. Defaults to field.return_mapper (
bool
, default:False
) – IfTrue
, returns{input_value: standardized_name}
.case_sensitive (
bool
, default:False
) – Whether the mapping is case sensitive.mute (
bool
, default:False
) – Whether to mute logging.source_aware (
bool
, default:True
) – Whether to standardize from public source. Defaults toTrue
for BioRecord registries.keep (
Literal
['first'
,'last'
,False
], default:'first'
) –When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated
: -"first"
: returns the first mapped standardized name -"last"
: returns the last mapped standardized name -False
: returns all mapped standardized name.When
keep
isFalse
, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (
str
, default:'synonyms'
) – A field containing the concatenated synonyms.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to validate against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
list
[str
] |dict
[str
,str
]- Returns:
If
return_mapper
isFalse
– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()
Add synonyms.
remove_synonym()
Remove synonyms.
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # standardize gene synonyms gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.standardize(gene_synonyms) #> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str
|None
) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list
[str
] |Series
|array
) – Values that will be validated against the field.field (
str
|DeferredAttribute
|None
, default:None
) – The field of values. Examples are'ontology_id'
to map against the source ID or'name'
to map against the ontologies field names.mute (
bool
, default:False
) – Whether to mute logging.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to validate against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
ndarray
- Returns:
A vector of booleans indicating if an element is validated.
See also
Example:
import bionty as bt bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human") #> array([ True, True, False, False])
Methods¶
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str
|list
[str
] |Series
|array
) – The synonyms to add to the record.force (
bool
, default:False
) – Whether to add synonyms even if they are already synonyms of other records.save (
bool
|None
, default:None
) – Whether to save the record to the database.
See also
remove_synonym()
Remove synonyms.
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # add a synonym record.add_synonym("T cells") record.synonyms #> "T cells|T-cell|T-lymphocyte|T lymphocyte"
- async adelete(using=None, keep_parents=False)¶
- async arefresh_from_db(using=None, fields=None, from_queryset=None)¶
- async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)¶
- clean()¶
Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- date_error_message(lookup_type, field_name, unique_for)¶
- delete()¶
Delete.
- Return type:
None
- get_constraints()¶
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- prepare_database_save(field)¶
- refresh_from_db(using=None, fields=None, from_queryset=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str
|list
[str
] |Series
|array
) – The synonym values to remove.
See also
add_synonym()
Add synonyms
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # remove a synonym record.remove_synonym("T-cell") record.synonyms #> "T lymphocyte|T-lymphocyte"
- save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)¶
Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.
The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.
- serializable_value(field_name)¶
Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.
Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str
) – A value for an abbreviation.
See also
Example:
import bionty as bt # save an experimental factor record scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() assert scrna.abbr is None assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing" # set abbreviation scrna.set_abbr("scRNA") assert scrna.abbr == "scRNA" # synonyms are updated assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"
- unique_error_message(model_class, unique_check)¶
- validate_constraints(exclude=None)¶
- validate_unique(exclude=None)¶
Check unique constraints on the model and raise ValidationError if any failed.