Standardize and append a dataset¶
Here, we’ll learn
how to standardize a less well curated dataset
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.track("ManDYgmftZ8C")
Let’s now consider a less-well curated dataset:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
adata
Create a curator:
curator = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.symbol,
categoricals={"cell_type_untrusted": bt.CellType.name},
organism="human",
)
curator.validate()
Standardize & validate genes
¶
Let’s convert Gene symbols to Ensembl ids via standardize()
. Note that this is a non-unique mapping and the first match is kept because the keep
parameter in .standardize()
defaults to "first"
:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
Here, we’ll use .raw
:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curator = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id,
categoricals={"cell_type_untrusted": bt.CellType.name},
organism="human",
)
curator.validate()
Standardize & validate cell types
¶
None of the cell type names are valid. We’ll now look up the non-validated cell types using the values of the public ontology and create a mapping.
curator.non_validated["cell_type_untrusted"]
ct_public_lo = bt.CellType.public().lookup()
name_mapping = {
"Dendritic cells": ct_public_lo.dendritic_cell.name,
"CD19+ B": ct_public_lo.b_cell_cd19_positive.name,
"CD4+/CD45RO+ Memory": ct_public_lo.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
"CD8+ Cytotoxic T": ct_public_lo.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
"CD4+/CD25 T Reg": ct_public_lo.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
"CD14+ Monocytes": ct_public_lo.cd14_positive_monocyte.name,
"CD56+ NK": ct_public_lo.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
"CD8+/CD45RA+ Naive Cytotoxic": ct_public_lo.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
"CD34+": ct_public_lo.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
}
We can now standardize cell type names using the lookup-based mapper:
adata_validated.obs["cell_type_untrusted_original"] = adata_validated.obs[
"cell_type_untrusted"
] # copy the original annotations
adata_validated.obs["cell_type_untrusted"] = adata_validated.obs[
"cell_type_untrusted_original"
].map(name_mapping)
Now, all cell types are validated:
curator.validate()
Register
¶
artifact = curator.save_artifact(description="10x reference adata")
artifact.view_lineage()
artifact.describe()
Artifact .h5ad · AnnData · dataset ├── General │ ├── uid: 6vAcF5cDU5zEKpDm0000 hash: 8cSIZsvUrKeGfL64-H9RLw │ ├── size: 839.7 KB n_observations: 70 │ ├── space: all branch: main │ ├── created_at: 2025-07-14 06:41:53 created_by: testuser1 (Test User1) │ ├── storage location / path: │ │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/6vAcF5cDU5zEKpDm0000.h5ad │ ├── description: 10x reference adata │ └── transform: scrna2.ipynb ├── Dataset features │ ├── var • 754 [bionty.Gene] │ │ HES4 float │ │ TNFRSF4 float │ │ SSU72 float │ │ PARK7 float │ │ RBP7 float │ │ SRM float │ │ MAD2L2 float │ │ AGTRAP float │ │ TNFRSF1B float │ │ EFHD2 float │ │ NECAP2 float │ │ HP1BP3 float │ │ C1QA float │ │ C1QB float │ │ HNRNPR float │ │ GALE float │ │ STMN1 float │ │ CD52 float │ │ FGR float │ │ ATP5IF1 float │ └── obs • 1 [Feature] │ cell_type_untrusted cat[bionty.CellType] B cell, CD19-positive, CD14-positive mo… └── Labels └── .cell_types bionty.CellType CD8-positive, alpha-beta memory T cell,…
Re-curate¶
We review the dataset and find all annotations trustworthy up there being a 'CD38-positive naive B cell'
.
Inspecting the name_mapping
in detail tells us 'CD8+/CD45RA+ Naive Cytotoxic'
was erroneously mapped on a B cell.
Let us correct this and create a 'cell_type'
feature that we can now trust.
name_mapping["CD38-positive naive B cell"] = "cytotoxic T cell"
adata_validated.obs["cell_type"] = adata_validated.obs[
"cell_type_untrusted_original"
].map(name_mapping)
adata_validated.obs["cell_type"].unique()
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']
artifact_trusted = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"cell_type_untrusted": bt.CellType.name,
},
organism="human",
).save_artifact(
description="10x reference adata, trusted cell type annotation",
revises=artifact,
)
! organism is ignored, define it on the dtype level
→ returning existing schema with same hash: Schema(uid='GMgfPJDRhj7KVLKn', n=754, is_type=False, itype='bionty.Gene', dtype='float', hash='cFusP7E9awXecG_aP3LCDA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=2, created_at=2025-07-14 06:41:53 UTC)
! 4 unique terms (66.70%) are not validated for name: 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original'
artifact_trusted.describe()
Append the dataset to the collection¶
Query the previous collection:
collection_v1 = ln.Collection.get(key="scrna/collection1")
Create a new version of the collection by sharding it across the new artifact
and the artifact underlying version 1 of the collection:
collection_v2 = collection_v1.append(artifact_trusted).save()
If you want, you can label the collection’s version by setting .version
.
collection_v2.version = "2"
collection_v2.save()
Version 2 of the collection covers significantly more conditions.
collection_v2.describe()
View data lineage:
collection_v2.view_lineage()