Project flow#
LaminDB allows tracking data flow on the entire project level.
Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.
A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-Ξ³ production.
These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.
More specifically: Why should I care about data flow?
Data flow tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.
While tracking data flow is easier when itβs governed by deterministic pipelines, it becomes hard when itβs governed by interactive human-driven analyses.
LaminDB interfaces workflow mangers for the former and embraces the latter.
Setup#
Init a test instance:
!lamin init --storage ./mydata
Show code cell output
π‘ creating schemas: core==0.47.5
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-05 09:42:14)
β
saved: Storage(id='9XR1I5WM', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata', type='local', updated_at=2023-09-05 09:42:14, created_by_id='DzTjkKse')
β
loaded instance: testuser1/mydata
π‘ did not register local instance on hub (if you want, call `lamin register`)
Import lamindb:
import lamindb as ln
from IPython.display import Image, display
β
loaded instance: testuser1/mydata (lamindb 0.52.2)
Steps#
In the following, we walk through exemplified steps covering different types of transforms (Transform
).
Note
The full notebooks are in this repository.
App upload of phenotypic data #
Register data through app upload from wetlab by testuser1
:
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)
output_path = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
output_file = ln.File(output_path, description="Raw data of schmidt22 crispra GWS")
output_file.save()
Show code cell output
β
logged in with email testuser1@lamin.ai and id DzTjkKse
β
saved: Transform(id='wBNFtaVQVr61m7', name='Upload GWS CRISPRa result', type='app', updated_at=2023-09-05 09:42:18, created_by_id='DzTjkKse')
β
saved: Run(id='AlYtOZxNKqqtrNbbbHjU', run_at=2023-09-05 09:42:18, transform_id='wBNFtaVQVr61m7', created_by_id='DzTjkKse')
π‘ file in storage 'mydata' with key 'schmidt22-crispra-gws-IFNG.csv'
Hit identification in notebook #
Access, transform & register data in drylab by testuser2
:
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)
# access
input_file = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
# identify hits
input_df = input_file.load().set_index("id")
output_df = input_df[input_df["pos|fdr"] < 0.01].copy()
# register hits in output file
ln.File(output_df, description="hits from schmidt22 crispra GWS").save()
Show code cell output
β
logged in with email testuser2@lamin.ai and id bKeW4T6E
β
saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-09-05 09:42:20)
β
saved: Transform(id='tJG0vmmWAsib0o', name='GWS CRIPSRa analysis', type='notebook', updated_at=2023-09-05 09:42:20, created_by_id='bKeW4T6E')
β
saved: Run(id='HF8uBxPJOu3oqaeQC2cA', run_at=2023-09-05 09:42:20, transform_id='tJG0vmmWAsib0o', created_by_id='bKeW4T6E')
π‘ adding file QqiEFIPBkCdUKOFYelrz as input for run HF8uBxPJOu3oqaeQC2cA, adding parent transform wBNFtaVQVr61m7
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/hlSmmFWbA3fT1rfIMiGD.parquet')
π‘ data is a dataframe, consider using .from_df() to link column names as features
β
storing file 'hlSmmFWbA3fT1rfIMiGD' at '.lamindb/hlSmmFWbA3fT1rfIMiGD.parquet'
Inspect data flow:
file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_flow()
Sequencer upload #
Upload files from sequencer:
ln.setup.login("testuser1")
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
# register output files of upload
upload_dir = ln.dev.datasets.dir_scrnaseq_cellranger(
"perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.File(upload_dir.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(upload_dir.parent / "fastq/perturbseq_R2_001.fastq.gz").save()
ln.setup.login("testuser2")
Show code cell output
β
logged in with email testuser1@lamin.ai and id DzTjkKse
β
saved: Transform(id='KqMfv7kntvpUYf', name='Chromium 10x upload', type='pipeline', updated_at=2023-09-05 09:42:22, created_by_id='DzTjkKse')
β
saved: Run(id='yOZnXri3kXvYg5oKQuAy', run_at=2023-09-05 09:42:22, transform_id='KqMfv7kntvpUYf', created_by_id='DzTjkKse')
β file has more than one suffix (path.suffixes), inferring:'.fastq.gz'
π‘ file in storage 'mydata' with key 'fastq/perturbseq_R1_001.fastq.gz'
β file has more than one suffix (path.suffixes), inferring:'.fastq.gz'
π‘ file in storage 'mydata' with key 'fastq/perturbseq_R2_001.fastq.gz'
β
logged in with email testuser2@lamin.ai and id bKeW4T6E
scRNA-seq bioinformatics pipeline #
Process uploaded files using a script or workflow manager: Pipelines and obtain 3 output files in a directory filtered_feature_bc_matrix/
:
transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")
ln.track(transform)
# access uploaded files as inputs for the pipeline
input_files = ln.File.filter(key__startswith="fastq/perturbseq").all()
input_paths = [file.stage() for file in input_files]
# register output files
output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)
Show code cell output
β
saved: Transform(id='JRWbXQhR4erjb0', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-09-05 09:42:24, created_by_id='bKeW4T6E')
β
saved: Run(id='67o0j7ZZjmFeZtoLAzHw', run_at=2023-09-05 09:42:24, transform_id='JRWbXQhR4erjb0', created_by_id='bKeW4T6E')
π‘ adding file hZm9RFhjApgUm1vCC6gp as input for run 67o0j7ZZjmFeZtoLAzHw, adding parent transform KqMfv7kntvpUYf
π‘ adding file K8C1oxvTfwKa4Wk5rw84 as input for run 67o0j7ZZjmFeZtoLAzHw, adding parent transform KqMfv7kntvpUYf
β file has more than one suffix (path.suffixes), inferring:'.tsv.gz'
β file has more than one suffix (path.suffixes), inferring:'.tsv.gz'
β file has more than one suffix (path.suffixes), using only last suffix: '.gz'
β
created 3 files from directory using storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata and key = perturbseq/filtered_feature_bc_matrix/
Post-process these 3 files:
transform = ln.Transform(name="Postprocess Cell Ranger", version="2.0", type="pipeline")
ln.track(transform)
input_files = [f.stage() for f in output_files]
output_path = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
output_file = ln.File(output_path, description="perturbseq counts")
output_file.save()
Show code cell output
β record with similar name exist! did you mean to load it?
id | __ratio__ | |
---|---|---|
name | ||
Cell Ranger | JRWbXQhR4erjb0 | 90.0 |
β
saved: Transform(id='n6dAT1x2WfMjd5', name='Postprocess Cell Ranger', version='2.0', type='pipeline', updated_at=2023-09-05 09:42:24, created_by_id='bKeW4T6E')
β
saved: Run(id='5EiTNxhsaurSHWXVXuUo', run_at=2023-09-05 09:42:24, transform_id='n6dAT1x2WfMjd5', created_by_id='bKeW4T6E')
π‘ adding file GmMx0Grahn4dXmmh5NMp as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
π‘ adding file rzIgE34O7VirmEbMPa5f as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
π‘ adding file mPVMtwHx6poqLcEMlrGV as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
π‘ file in storage 'mydata' with key 'schmidt22_perturbseq.h5ad'
π‘ data is AnnDataLike, consider using .from_anndata() to link var_names and obs.columns as features
Inspect data flow:
output_files[0].view_flow()
Integrate scRNA-seq & phenotypic data #
Integrate data in a notebook:
transform = ln.Transform(
name="Perform single cell analysis, integrate with CRISPRa screen",
type="notebook",
)
ln.track(transform)
file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
file_hits = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
screen_hits = file_hits.load()
import scanpy as sc
sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
Show code cell output
β records with similar names exist! did you mean to load one of them?
id | __ratio__ | |
---|---|---|
name | ||
Cell Ranger | JRWbXQhR4erjb0 | 85.5 |
GWS CRIPSRa analysis | tJG0vmmWAsib0o | 85.5 |
Postprocess Cell Ranger | n6dAT1x2WfMjd5 | 85.5 |
Upload GWS CRISPRa result | wBNFtaVQVr61m7 | 85.5 |
β
saved: Transform(id='MQ6l5hB4xg9xLT', name='Perform single cell analysis, integrate with CRISPRa screen', type='notebook', updated_at=2023-09-05 09:42:25, created_by_id='bKeW4T6E')
β
saved: Run(id='RMCrFvPxOQimm3gbjSG8', run_at=2023-09-05 09:42:25, transform_id='MQ6l5hB4xg9xLT', created_by_id='bKeW4T6E')
π‘ adding file eLZEJDOwapnJA6AHGxT0 as input for run RMCrFvPxOQimm3gbjSG8, adding parent transform n6dAT1x2WfMjd5
π‘ adding file hlSmmFWbA3fT1rfIMiGD as input for run RMCrFvPxOQimm3gbjSG8, adding parent transform tJG0vmmWAsib0o
WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png
π‘ file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'
β
storing file 'G8K8uJCveZyQb6Gu9FVY' at 'figures/umap_fig1_score-wgs-hits.png'
WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png
π‘ file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'
β
storing file 'JECTpprgQOzf4g6aT7VX' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'
Review results#
Letβs load one of the plots:
ln.track()
file = ln.File.filter(key__contains="figures/matrixplot").one()
file.stage()
Show code cell output
π‘ notebook imports: ipython==8.15.0 lamindb==0.52.2 scanpy==1.9.4
β
saved: Transform(id='1LCd8kco9lZUz8', name='Project flow', short_name='project-flow', version='0', type=notebook, updated_at=2023-09-05 09:42:28, created_by_id='bKeW4T6E')
β
saved: Run(id='zKwc8B9EEvpIkwWkfuHm', run_at=2023-09-05 09:42:28, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')
π‘ adding file JECTpprgQOzf4g6aT7VX as input for run zKwc8B9EEvpIkwWkfuHm, adding parent transform MQ6l5hB4xg9xLT
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/figures/matrixplot_fig2_score-wgs-hits-per-cluster.png')
display(Image(filename=file.path))
We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:
file.view_flow()
Alternatively, we can also look at the sequence of transforms:
transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()
transform.parents.df()
name | short_name | version | type | reference | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
n6dAT1x2WfMjd5 | Postprocess Cell Ranger | None | 2.0 | pipeline | None | None | 2023-09-05 09:42:25 | bKeW4T6E |
tJG0vmmWAsib0o | GWS CRIPSRa analysis | None | None | notebook | None | None | 2023-09-05 09:42:20 | bKeW4T6E |
transform.view_parents()
Understand runs#
We tracked pipeline and notebook runs through run_context
, which stores a Transform
and a Run
record as a global context.
File
objects are the inputs and outputs of runs.
What if I donβt want a global context?
Sometimes, we donβt want to create a global run context but manually pass a run when creating a file:
run = ln.Run(transform=transform)
ln.File(filepath, run=run)
When does a file appear as a run input?
When accessing a file via stage()
, load()
or backed()
, two things happen:
The current run gets added to
file.input_of
The transform of that file gets added as a parent of the current transform
You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False
: Can I disable tracking run inputs?
You can also track run inputs on a case by case basis via is_run_input=True
, e.g., here:
file.load(is_run_input=True)
Query by provenance#
We can query or search for the notebook that created the file:
transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()
And then find all the files created by that notebook:
ln.File.filter(transform=transform).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
hlSmmFWbA3fT1rfIMiGD | 9XR1I5WM | None | .parquet | DataFrame | hits from schmidt22 crispra GWS | None | 18368 | O2Owo0_QlM9JBS2zAZD4Lw | md5 | tJG0vmmWAsib0o | HF8uBxPJOu3oqaeQC2cA | None | 2023-09-05 09:42:20 | bKeW4T6E |
Which transform ingested a given file?
file = ln.File.filter().first()
file.transform
Transform(id='wBNFtaVQVr61m7', name='Upload GWS CRISPRa result', type='app', updated_at=2023-09-05 09:42:18, created_by_id='DzTjkKse')
And which user?
file.created_by
User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-05 09:42:22)
Which transforms were created by a given user?
users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser2).df()
name | short_name | version | type | reference | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
tJG0vmmWAsib0o | GWS CRIPSRa analysis | None | None | notebook | None | None | 2023-09-05 09:42:20 | bKeW4T6E |
JRWbXQhR4erjb0 | Cell Ranger | None | 7.2.0 | pipeline | None | None | 2023-09-05 09:42:24 | bKeW4T6E |
n6dAT1x2WfMjd5 | Postprocess Cell Ranger | None | 2.0 | pipeline | None | None | 2023-09-05 09:42:25 | bKeW4T6E |
MQ6l5hB4xg9xLT | Perform single cell analysis, integrate with C... | None | None | notebook | None | None | 2023-09-05 09:42:27 | bKeW4T6E |
1LCd8kco9lZUz8 | Project flow | project-flow | 0 | notebook | None | None | 2023-09-05 09:42:28 | bKeW4T6E |
Which notebooks were created by a given user?
ln.Transform.filter(created_by=users.testuser2, type="notebook").df()
name | short_name | version | type | reference | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
tJG0vmmWAsib0o | GWS CRIPSRa analysis | None | None | notebook | None | None | 2023-09-05 09:42:20 | bKeW4T6E |
MQ6l5hB4xg9xLT | Perform single cell analysis, integrate with C... | None | None | notebook | None | None | 2023-09-05 09:42:27 | bKeW4T6E |
1LCd8kco9lZUz8 | Project flow | project-flow | 0 | notebook | None | None | 2023-09-05 09:42:28 | bKeW4T6E |
We can also view all recent additions to the entire database:
ln.view()
Show code cell output
File
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
JECTpprgQOzf4g6aT7VX | 9XR1I5WM | figures/matrixplot_fig2_score-wgs-hits-per-clu... | .png | None | None | None | 28814 | JYIPcat0YWYVCX3RVd3mww | md5 | MQ6l5hB4xg9xLT | RMCrFvPxOQimm3gbjSG8 | None | 2023-09-05 09:42:27 | bKeW4T6E |
G8K8uJCveZyQb6Gu9FVY | 9XR1I5WM | figures/umap_fig1_score-wgs-hits.png | .png | None | None | None | 118999 | laQjVk4gh70YFzaUyzbUNg | md5 | MQ6l5hB4xg9xLT | RMCrFvPxOQimm3gbjSG8 | None | 2023-09-05 09:42:27 | bKeW4T6E |
eLZEJDOwapnJA6AHGxT0 | 9XR1I5WM | schmidt22_perturbseq.h5ad | .h5ad | AnnData | perturbseq counts | None | 20659936 | la7EvqEUMDlug9-rpw-udA | md5 | n6dAT1x2WfMjd5 | 5EiTNxhsaurSHWXVXuUo | None | 2023-09-05 09:42:25 | bKeW4T6E |
rzIgE34O7VirmEbMPa5f | 9XR1I5WM | perturbseq/filtered_feature_bc_matrix/features... | .tsv.gz | None | None | None | 6 | -I3mht0rimsP3zbzNsyZjg | md5 | JRWbXQhR4erjb0 | 67o0j7ZZjmFeZtoLAzHw | None | 2023-09-05 09:42:24 | bKeW4T6E |
mPVMtwHx6poqLcEMlrGV | 9XR1I5WM | perturbseq/filtered_feature_bc_matrix/matrix.m... | .gz | None | None | None | 6 | Owckv-3wNqtc2lH5QLTKQg | md5 | JRWbXQhR4erjb0 | 67o0j7ZZjmFeZtoLAzHw | None | 2023-09-05 09:42:24 | bKeW4T6E |
GmMx0Grahn4dXmmh5NMp | 9XR1I5WM | perturbseq/filtered_feature_bc_matrix/barcodes... | .tsv.gz | None | None | None | 6 | tErw-NvJmpLGIvbI2Rv7Aw | md5 | JRWbXQhR4erjb0 | 67o0j7ZZjmFeZtoLAzHw | None | 2023-09-05 09:42:24 | bKeW4T6E |
K8C1oxvTfwKa4Wk5rw84 | 9XR1I5WM | fastq/perturbseq_R2_001.fastq.gz | .fastq.gz | None | None | None | 6 | ntMqLgB2ytCIza2x1q_gDg | md5 | KqMfv7kntvpUYf | yOZnXri3kXvYg5oKQuAy | None | 2023-09-05 09:42:22 | DzTjkKse |
Run
transform_id | run_at | created_by_id | reference | reference_type | |
---|---|---|---|---|---|
id | |||||
AlYtOZxNKqqtrNbbbHjU | wBNFtaVQVr61m7 | 2023-09-05 09:42:18 | DzTjkKse | None | None |
HF8uBxPJOu3oqaeQC2cA | tJG0vmmWAsib0o | 2023-09-05 09:42:20 | bKeW4T6E | None | None |
yOZnXri3kXvYg5oKQuAy | KqMfv7kntvpUYf | 2023-09-05 09:42:22 | DzTjkKse | None | None |
67o0j7ZZjmFeZtoLAzHw | JRWbXQhR4erjb0 | 2023-09-05 09:42:24 | bKeW4T6E | None | None |
5EiTNxhsaurSHWXVXuUo | n6dAT1x2WfMjd5 | 2023-09-05 09:42:24 | bKeW4T6E | None | None |
RMCrFvPxOQimm3gbjSG8 | MQ6l5hB4xg9xLT | 2023-09-05 09:42:25 | bKeW4T6E | None | None |
zKwc8B9EEvpIkwWkfuHm | 1LCd8kco9lZUz8 | 2023-09-05 09:42:28 | bKeW4T6E | None | None |
Storage
root | type | region | updated_at | created_by_id | |
---|---|---|---|---|---|
id | |||||
9XR1I5WM | /home/runner/work/lamin-usecases/lamin-usecase... | local | None | 2023-09-05 09:42:14 | DzTjkKse |
Transform
name | short_name | version | type | reference | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1LCd8kco9lZUz8 | Project flow | project-flow | 0 | notebook | None | None | 2023-09-05 09:42:28 | bKeW4T6E |
MQ6l5hB4xg9xLT | Perform single cell analysis, integrate with C... | None | None | notebook | None | None | 2023-09-05 09:42:27 | bKeW4T6E |
n6dAT1x2WfMjd5 | Postprocess Cell Ranger | None | 2.0 | pipeline | None | None | 2023-09-05 09:42:25 | bKeW4T6E |
JRWbXQhR4erjb0 | Cell Ranger | None | 7.2.0 | pipeline | None | None | 2023-09-05 09:42:24 | bKeW4T6E |
KqMfv7kntvpUYf | Chromium 10x upload | None | None | pipeline | None | None | 2023-09-05 09:42:22 | DzTjkKse |
tJG0vmmWAsib0o | GWS CRIPSRa analysis | None | None | notebook | None | None | 2023-09-05 09:42:20 | bKeW4T6E |
wBNFtaVQVr61m7 | Upload GWS CRISPRa result | None | None | app | None | None | 2023-09-05 09:42:18 | DzTjkKse |
User
handle | name | updated_at | ||
---|---|---|---|---|
id | ||||
bKeW4T6E | testuser2 | testuser2@lamin.ai | Test User2 | 2023-09-05 09:42:24 |
DzTjkKse | testuser1 | testuser1@lamin.ai | Test User1 | 2023-09-05 09:42:22 |
Show code cell content
!lamin login testuser1
!lamin delete --force mydata
!rm -r ./mydata
β
logged in with email testuser1@lamin.ai and id DzTjkKse
π‘ deleting instance testuser1/mydata
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--mydata.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata