CELLxGENE: scRNA-seq¶

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in two ways:

Query collections of AnnData objects.
Slice a big array store produced by concatenated AnnData objects via tiledbsoma.

If you are interested in building similar data assets in-house:

See the transfer guide to zero-copy data to your own LaminDB instance.
See the scRNA guide to create a growing, standardized & versioned scRNA-seq dataset collection.

Connect to the public LaminDB instance that mirrors cellxgene:

# pip install 'lamindb[bionty,jupyter]'
!lamin connect laminlabs/cellxgene

import lamindb as ln
import bionty as bt

Query & understand metadata¶

Auto-complete metadata¶

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

cell_types = bt.CellType.lookup()
cell_types.effector_t_cell

You can also arbitrarily chain filters and create lookups from them:

users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(type__name="SuspensionType").lookup()
# here we choose to return .name directly
features = ln.Feature.lookup(return_field="name")
assays = bt.ExperimentalFactor.lookup(return_field="name")

Search & filter metadata¶

We can use search & filters for metadata:

bt.CellType.search("effector T cell").df().head()

Show code cell output

Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	space_id	source_id	run_id	created_at	created_by_id	_aux	branch_id
id
1623	3nfZTVV4	effector T cell	CL:0000911	None	effector T-cell\|effector T-lymphocyte\|effector...	A Differentiated T Cell With Ability To Traffi...	1	48	None	2023-11-28 22:30:57.481760+00:00	1	None	1
1169	6JD5JCZC	CD8-positive, alpha-beta cytokine secreting ef...	CL:0000908	None	CD8-positive, alpha-beta cytokine secreting ef...	A Cd8-Positive, Alpha-Beta T Cell With The Phe...	1	48	None	2023-11-28 22:27:55.571572+00:00	1	None	1
1229	69TEBGqb	exhausted T cell	CL:0011025	None	Tex cell\|An effector T cell that displays impa...	None	1	48	None	2023-11-28 22:27:55.572880+00:00	1	None	1
1331	43cBCa7s	helper T cell	CL:0000912	None	helper T-lymphocyte\|T-helper cell\|helper T lym...	A Effector T Cell That Provides Help In The Fo...	1	48	None	2023-11-28 22:27:55.575949+00:00	1	None	1
1503	1oa5G2Mq	memory T cell	CL:0000813	None	memory T-cell\|memory T lymphocyte\|memory T-lym...	A Long-Lived, Antigen-Experienced T Cell That ...	1	48	None	2023-11-28 22:27:55.580286+00:00	1	None	1

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell

Understand ontologies¶

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)

Or access them programmatically:

effector_t_cell.children.df()

Show code cell output

Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	space_id	source_id	run_id	created_at	created_by_id	_aux	branch_id
id
931	2VQirdSp	effector CD8-positive, alpha-beta T cell	CL:0001050	None	effector CD8-positive, alpha-beta T lymphocyte...	A Cd8-Positive, Alpha-Beta T Cell With The Phe...	1	48	None	2023-11-28 22:27:55.565976+00:00	1	None	1
1088	490Xhb24	effector CD4-positive, alpha-beta T cell	CL:0001044	None	effector CD4-positive, alpha-beta T lymphocyte...	A Cd4-Positive, Alpha-Beta T Cell With The Phe...	1	48	None	2023-11-28 22:27:55.569828+00:00	1	None	1
1229	69TEBGqb	exhausted T cell	CL:0011025	None	Tex cell\|An effector T cell that displays impa...	None	1	48	None	2023-11-28 22:27:55.572880+00:00	1	None	1
1309	5s4gCMdn	cytotoxic T cell	CL:0000910	None	cytotoxic T lymphocyte\|cytotoxic T-lymphocyte\|...	A Mature T Cell That Differentiated And Acquir...	1	48	None	2023-11-28 22:27:55.575440+00:00	1	None	1
1331	43cBCa7s	helper T cell	CL:0000912	None	helper T-lymphocyte\|T-helper cell\|helper T lym...	A Effector T Cell That Provides Help In The Fo...	1	48	None	2023-11-28 22:27:55.575949+00:00	1	None	1

Query for individual datasets¶

Every individual dataset in CELLxGENE is an .h5ad file that is stored as an artifact in LaminDB. Here is an exemplary query:

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[
        cell_types.b_cell,
        cell_types.t_cell,
    ],  # cell types measured in AnnData
    created_by=users.sunnyosun,  # creator
).order_by("created_at").df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()

Show code cell output

Hide code cell output

	uid	key	cell_types__name	created_by__handle
id
879	BCutg5cxmqLmy2Z5SS8J	cell-census/2023-07-25/h5ads/01ad3cd7-3929-465...	{plasmablast, B cell, CD8-positive, alpha-beta...	sunnyosun
1106	3xdOASXuAxxJtSchJO3D	cell-census/2023-07-25/h5ads/48101fa2-1a63-451...	{pro-B cell, common myeloid progenitor, early ...	sunnyosun
1174	wt7eD72sTzwL3rfYaZr2	cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0...	{CD8-positive, alpha-beta T cell, B cell, eosi...	sunnyosun
1377	znTBqWgfYgFlLjdQ6Ba7	cell-census/2023-07-25/h5ads/9dbab10c-118d-496...	{B cell, CD8-positive, alpha-beta T cell, CD14...	sunnyosun
1482	dEP0dZ8UxLgwnkLjz6Iq	cell-census/2023-07-25/h5ads/bd65a70f-b274-413...	{naive thymus-derived CD8-positive, alpha-beta...	sunnyosun

To see what you can query for, look at the registry representation.

ln.Artifact

Slice an individual dataset¶

Let’s look at an artifact and show its metadata using .describe().

artifact = ln.Artifact.get(description="Mature kidney dataset: immune", is_latest=True)
artifact.describe()

Show code cell output

Hide code cell output

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: WwmBIhBNLTlRcSoBDt76          hash: GCMHkdQSTeXxRVF7gMZFIA
│   ├── size: 43.1 MB                      n_observations: 7803
│   ├── version: 2024-07-01                space: all
│   ├── branch: main                       created_at: 2024-07-12 12:34:09
│   ├── created_by: sunnyosun (Sunny Sun)
│   ├── key: cell-census/2024-07-01/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
│   ├── storage location / path: 
│   │   s3://cellxgene-data-public/cell-census/2024-07-01/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
│   ├── description: Mature kidney dataset: immune
│   └── transform: cencus-release-2024-07-01-LTS
├── Dataset features
│   ├── obs • 11                        [Feature]                                                                  
│   │   assay                           cat[bionty.ExperimentalFactor]     10x 3' v2                               
│   │   cell_type                       cat[bionty.CellType]               B cell, CD4-positive, alpha-beta T cell…
│   │   development_stage               cat[bionty.DevelopmentalStage]     12-year-old human stage, 19-month-old h…
│   │   disease                         cat[bionty.Disease]                normal                                  
│   │   donor_id                        cat[ULabel]                        RCC1, RCC2, RCC3, TTx, TxK1, TxK2, TxK3…
│   │   organism                        cat[bionty.Organism]               human                                   
│   │   self_reported_ethnicity         cat[bionty.Ethnicity]              unknown                                 
│   │   sex                             cat[bionty.Phenotype]              female, male                            
│   │   suspension_type                 cat[ULabel]                        cell                                    
│   │   tissue                          cat[bionty.Tissue]                 cortex of kidney, kidney, kidney blood …
│   │   tissue_type                     cat[ULabel]                                                                
│   └── var • 32839                     [bionty.Gene]                                                              
│       EBF1                            num                                                                        
│       LINC02202                       num                                                                        
│       RNF145                          num                                                                        
│       LINC01932                       num                                                                        
│       UBLCP1                          num                                                                        
│       IL12B                           num                                                                        
│       LINC01845                       num                                                                        
│       LINC01847                       num                                                                        
│       ADRA1B                          num                                                                        
│       TTC1                            num                                                                        
│       PWWP2A                          num                                                                        
│       FABP6                           num                                                                        
│       FABP6-AS1                       num                                                                        
│       CCNJL                           num                                                                        
│       C1QTNF2                         num                                                                        
└── Labels
    └── .references                     Reference                          Spatiotemporal immune zonation of the h…
        .organisms                      bionty.Organism                    human                                   
        .tissues                        bionty.Tissue                      kidney blood vessel, renal pelvis, cort…
        .cell_types                     bionty.CellType                    CD8-positive, alpha-beta T cell, mature…
        .diseases                       bionty.Disease                     normal                                  
        .phenotypes                     bionty.Phenotype                   male, female                            
        .experimental_factors           bionty.ExperimentalFactor          10x 3' v2                               
        .developmental_stages           bionty.DevelopmentalStage          2-year-old human stage, 4-year-old huma…
        .ethnicities                    bionty.Ethnicity                   unknown                                 
        .ulabels                        ULabel                             TxK2, Wilms1, TxK4, TTx, RCC3, RCC1, VH…

If you want to query a slice of the array data, you have two options:

Cache the artifact on disk and return the path to the cached data. Doesn’t download anything if the artifact is already in the cache.
Cache & load the entire artifact into memory via artifact.load() -> AnnData
Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

Both will run much faster in the AWS us-west-2 data center.

Cache:

cache_path = artifact.cache()
cache_path

Cache & load:

adata = artifact.load()
adata

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

Stream, slice and load the slice into memory:

with artifact.open() as adata_backed:
    display(adata_backed)

We now have an AnnDataAccessor object, which behaves much like an AnnData, and slicing looks similar to the query above.

Query collections of datasets¶

Let’s search collections from CELLxGENE within the 2024-07-01 release:

ln.Collection.filter(version="2024-07-01").search("human retina", limit=10)

Show code cell output

Hide code cell output

<QuerySet [Collection(uid='2gBKIwx8AtCHc4nfcQqc', version='2024-07-01', is_latest=True, key='A single-cell transcriptome atlas of the adult human retina', description='10.15252/embj.2018100811', hash='sCh4gUTJJJjECsp1dj0q', reference='3472f32d-4a33-48e2-aad5-666d4631bf4c', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:47 UTC), Collection(uid='zZLyhpo1aDdxdbULFbVT', version='2024-07-01', is_latest=True, key='Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration', description='10.1038/s41467-019-12780-8', hash='1B0m9_FahAvefSTM8_AV', reference='1a486c4c-c115-4721-8c9f-f9f096e10857', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:25 UTC), Collection(uid='tZYmzwfh0bIYzKBQVuro', version='2024-07-01', is_latest=True, key='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='nGcCV4HJONcma2SExXw2', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:55 UTC), Collection(uid='8ohRJQq8e3F7pdlBZbhz', version='2024-07-01', is_latest=True, key='Single cell atlas of the human retina', description='10.1101/2023.11.07.566105', hash='_vU7tll3t-0NCuJL-fm0', reference='4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:25 UTC), Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=True, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC), Collection(uid='Yxth0JJgMb2VVOCfSgWj', version='2024-07-01', is_latest=True, key='Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration', description='10.1073/pnas.1914143116', hash='j2LqihaaNawOtEFysl3c', reference='f8057c47-fcd8-4fcf-88b0-e2f930080f6e', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:55 UTC)]>

Let’s get the record of the top hit collection:

collection = ln.Collection.get("quQDnLsMLkP3JRsC8gp4")
collection

It’s a Science paper and we can find more information on it using the DOI or CELLxGENE collection id. There are multiple versions of this collection.

collection.versions.df()

Show code cell output

Hide code cell output

	uid	key	description	hash	reference	reference_type	space_id	meta_artifact_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
606	quQDnLsMLkP3JRsC8gp4	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	NIo8G6_reJTEqMzW2nMc	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	None	2024-07-01	True	27.0	2024-07-16 12:21:51.449109+00:00	1	None	1
291	quQDnLsMLkP3JRsCJNGB	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	FsD52kpR7dF2h78-P3ka	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	None	2023-12-15	False	22.0	2024-01-11 13:41:01.880382+00:00	1	None	1
134	quQDnLsMLkP3JRsC6WWz	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	xhfSShX8lypXPx00zevx	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	None	2023-07-25	False	NaN	2024-01-08 12:22:12.891930+00:00	1	None	1

The collection groups artifacts.

collection.artifacts.df()

Show code cell output

Hide code cell output

! no run & transform got linked, call `ln.track()` & re-run

! run input wasn't tracked, call `ln.track()` and re-run

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
2855	wYiUe9hn4TJijpoX90Mr	cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6...	All major cell types in adult human retina	.h5ad	dataset	AnnData	14638089351	bXxaz_quQ4mIbVlarLZZKQ	None	244474	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:08.826175+00:00	1	None	1
3018	QpuY5RsGTBBMN61QGY4t	cell-census/2024-07-01/h5ads/359f7af4-87d4-411...	Amacrine cells in human retina	.h5ad	dataset	AnnData	3382221253	S7gXlC-cJ362BOqYZFxMOA	None	56507	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.160201+00:00	1	None	1
2852	Oc6ANFJ0FgOW1B70mNIq	cell-census/2024-07-01/h5ads/00e5dedd-b9b7-43b...	Photoreceptor cells in human retina (rod cells...	.h5ad	dataset	AnnData	990594324	qFT65q6_k30pki8-1_2HoQ	None	21422	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:08.813762+00:00	1	None	1
2919	GA2BXWwoJlcRfzNp3iyQ	cell-census/2024-07-01/h5ads/11ef37ee-2173-458...	Horizontal cells in human retina	.h5ad	dataset	AnnData	404987285	fR0O7fSUHxmAfEDC8J7Ipw	None	7348	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:08.949267+00:00	1	None	1
3273	1OyQQLNfu1nzvVADODND	cell-census/2024-07-01/h5ads/8f10185b-e0b3-46a...	Bipolar cells in human retina	.h5ad	dataset	AnnData	3075818557	1GQwZcymSrr7d2Xit-5Deg	None	53040	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.644258+00:00	1	None	1
3378	Ce4Mqe4X2vUhwkwnh5YQ	cell-census/2024-07-01/h5ads/aad97cb5-f375-45e...	Retinal ganglion cells in human retina	.h5ad	dataset	AnnData	784580498	w-_LJDfBv7vsZqw-9Jt72g	None	11617	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.816906+00:00	1	None	1
3600	80xlsVmayPPBCCEZ7aBc	cell-census/2024-07-01/h5ads/ed419b4e-db9b-40f...	Non-neuronal cells in human retina	.h5ad	dataset	AnnData	1070671504	slN6j-9aSrYFw-IPL-wv-A	None	18011	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:10.255394+00:00	1	None	1

Let’s now look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts.

collection = ln.Collection.get(key="cellxgene-census", version="2024-07-01")
collection

You can count all contained artifacts or get them as a dataframe.

collection.artifacts.count()

collection.artifacts.df().head()  # not tracking run & transform because read-only instance

Show code cell output

Hide code cell output

! no run & transform got linked, call `ln.track()` & re-run

! run input wasn't tracked, call `ln.track()` and re-run

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
3305	1BNWhcCqu1CMSJaHxpbn	cell-census/2024-07-01/h5ads/98e5ea9f-16d6-47e...	All - A single-cell transcriptomic atlas chara...	.h5ad	dataset	AnnData	2578203515	k-aZJBIjuvnO5Vek3JK-Mg	None	110824	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.696918+00:00	1	None	1
3301	aJTH55LW2CTIWu306YiY	cell-census/2024-07-01/h5ads/98113e7e-f586-406...	Supercluster: Deep-layer intratelencephalic	.h5ad	dataset	AnnData	3521994530	B8cjeVHgg9Q9Rr-JGaUjfg	None	228467	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.690020+00:00	1	None	1
3313	pnQX4jvkj3eFWGOzDxbW	cell-census/2024-07-01/h5ads/9b686bb6-1427-4e1...	Evolution of cellular diversity in primary mot...	.h5ad	dataset	AnnData	107509355	Z-uGNA6tRhMB1q46A3R8yg	None	10739	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.715809+00:00	1	None	1
3566	2bF2gDSwbNbDsFVg2KQf	cell-census/2024-07-01/h5ads/e4ddac12-f48f-445...	Supercluster: CGE-derived interneurons	.h5ad	dataset	AnnData	2586217727	8IDdkinp07n9AgQaWH9yUw	None	129495	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:10.200783+00:00	1	None	1
2879	Pvhx7GAmAt4SYg03sE0M	cell-census/2024-07-01/h5ads/06ef6b36-6c9b-4e1...	Single nucleus transcriptomic profiling of hum...	.h5ad	dataset	AnnData	92790726	V9KkecqXGqQJRF1lluo6Kg	None	10533	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:08.881323+00:00	1	None	1

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame

Show code cell output

Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
2961	WwmBIhBNLTlRcSoBDt76	cell-census/2024-07-01/h5ads/20d87640-4be8-487...	Mature kidney dataset: immune	.h5ad	dataset	AnnData	45158726	GCMHkdQSTeXxRVF7gMZFIA	None	7803	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.039540+00:00	1	None	1
3000	gHlQ5Muwu3G9pvFCx3x8	cell-census/2024-07-01/h5ads/2d31c0ca-0233-41c...	Fetal kidney dataset: immune	.h5ad	dataset	AnnData	64546349	2qy8uy-65Sd_XcBU-nrPgA	None	6847	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.128217+00:00	1	None	1
3324	P4Oai3OLGAzRwoicHfLM	cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b...	Mature kidney dataset: full	.h5ad	dataset	AnnData	194047623	aZVpGZwAfMCziff_5ow2bg	None	40268	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:09.732579+00:00	1	None	1
2914	DSpevwaIl5E2jIWHbui4	cell-census/2024-07-01/h5ads/105c7dad-0468-462...	mature	.h5ad	dataset	AnnData	233914522	pz2wn0GB8pcRRupfY03gKQ	None	40268	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:08.941671+00:00	1	None	1
3519	11HQaMeIUaOwyHoOkqqM	cell-census/2024-07-01/h5ads/d7dcfd8f-2ee7-438...	Fetal kidney dataset: full	.h5ad	dataset	AnnData	342398936	CzNBRaQGupXRxF5IntjWBg	None	27197	md5-n	False	False	1	2	None	2024-07-01	True	27	2024-07-12 12:34:10.101903+00:00	1	None	1

Slice a concatenated array¶

Let us now use the concatenated version of the Census collection: a tiledbsoma array that concatenates all AnnData arrays present in the collection we just explored. Slicing tiledbsoma works similar to slicing DataFrame or AnnData.

value_filter = (
    f'{features.tissue} == "{tissues.brain.name}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell.name}", "{cell_types.neuron.name}"] and'
    f' {features.suspension_type} == "{suspension_types.cell.name}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter

'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data. It’s a new dataset produced by concatenating all AnnData-like artifacts in the Census collection.

census_artifact = ln.Artifact.get(description="Census 2024-07-01")

Run the slicing operation.

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census_artifact.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.head()

Show code cell output

Hide code cell output

! run input wasn't tracked, call `ln.track()` and re-run

	soma_joinid	dataset_id	assay	assay_ontology_term_id	cell_type	cell_type_ontology_term_id	development_stage	development_stage_ontology_term_id	disease	disease_ontology_term_id	donor_id	is_primary_data	observation_joinid	self_reported_ethnicity	self_reported_ethnicity_ontology_term_id	sex	sex_ontology_term_id	suspension_type	tissue	tissue_ontology_term_id	tissue_type	tissue_general	tissue_general_ontology_term_id	raw_sum	nnz	raw_mean_nnz	raw_variance_nnz	n_measured_vars
0	48182177	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	ND8	False	nFxu2<I`99	unknown	unknown	female	PATO:0000383	cell	brain	UBERON:0000955	tissue	brain	UBERON:0000955	15204.0	3959	3.840364	209.374207	27229
1	48182178	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	ND8	False	^&Q&1Vr4;R	unknown	unknown	female	PATO:0000383	cell	brain	UBERON:0000955	tissue	brain	UBERON:0000955	39230.0	5885	6.666100	875.502870	27229
2	48182185	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	ND8	False	Ntw30M&-o`	unknown	unknown	female	PATO:0000383	cell	brain	UBERON:0000955	tissue	brain	UBERON:0000955	9576.0	2738	3.497443	121.333753	27229
3	48182187	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	ND8	False	CJRQ3Cw(k%	unknown	unknown	female	PATO:0000383	cell	brain	UBERON:0000955	tissue	brain	UBERON:0000955	19374.0	4096	4.729980	464.331956	27229
4	48182188	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	ND8	False	Vbyn6NwT~A	unknown	unknown	female	PATO:0000383	cell	brain	UBERON:0000955	tissue	brain	UBERON:0000955	8466.0	2477	3.417844	162.555950	27229

Create an AnnData object.

from tiledbsoma import AxisQuery

with census_artifact.open() as store:
    experiment = store["census_data"][human]
    adata = experiment.axis_query(
        "RNA", obs_query=AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )

adata.var = adata.var.set_index("feature_id")
adata

! run input wasn't tracked, call `ln.track()` and re-run

AnnData object with n_obs × n_vars = 66418 × 60530
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'

Train ML models¶

You can either directly train ML models on very large collections of AnnData-like artifacts or on a single concatenated tiledbsoma-like artifact. For pros & cons of these approaches, see this blog post.

On a collection of arrays¶

mapped() caches AnnData objects on disk and creates a map-style dataset that performs a virtual join of the features of the underlying AnnData objects.

from torch.utils.data import DataLoader

census_collection = ln.Collection.get(name="cellxgene-census", version="2024-07-01")

dataset = census_collection.mapped(obs_keys=[features.cell_type], join="outer")

dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in dataloader:
    pass

dataset.close()

For more background, see Train a machine learning model on a collection.

On a concatenated array¶

You can create streaming PyTorch dataloaders from tiledbsoma stores using cellxgene_census package.

import cellxgene_census.experimental.ml as census_ml

store = census_artifact.open()

experiment = store["census_data"][human]
experiment_datapipe = census_ml.ExperimentDataPipe(
    experiment,
    measurement_name="RNA",
    X_name="raw",
    obs_query=AxisQuery(value_filter=value_filter),
    obs_column_names=[features.cell_type],
    batch_size=128,
    shuffle=True,
    soma_chunk_size=10000,
)
experiment_dataloader = census_ml.experiment_dataloader(experiment_datapipe)

for batch in experiment_dataloader:
    pass

store.close()

For more background see this guide.