Jupyter Notebook

Project flow#

LaminDB allows tracking data flow on the entire project level.

Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.

A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-Ξ³ production.

These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.

More specifically: Why should I care about data flow?

Data flow tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.

While tracking data flow is easier when it’s governed by deterministic pipelines, it becomes hard when it’s governed by interactive human-driven analyses.

LaminDB interfaces workflow mangers for the former and embraces the latter.

Setup#

Init a test instance:

!lamin init --storage ./mydata
Hide code cell output
πŸ’‘ creating schemas: core==0.47.5 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-05 09:42:14)
βœ… saved: Storage(id='9XR1I5WM', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata', type='local', updated_at=2023-09-05 09:42:14, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/mydata
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

Import lamindb:

import lamindb as ln
from IPython.display import Image, display
βœ… loaded instance: testuser1/mydata (lamindb 0.52.2)

Steps#

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Note

The full notebooks are in this repository.

App upload of phenotypic data #

Register data through app upload from wetlab by testuser1:

ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)
output_path = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
output_file = ln.File(output_path, description="Raw data of schmidt22 crispra GWS")
output_file.save()
Hide code cell output
βœ… logged in with email testuser1@lamin.ai and id DzTjkKse
βœ… saved: Transform(id='wBNFtaVQVr61m7', name='Upload GWS CRISPRa result', type='app', updated_at=2023-09-05 09:42:18, created_by_id='DzTjkKse')
βœ… saved: Run(id='AlYtOZxNKqqtrNbbbHjU', run_at=2023-09-05 09:42:18, transform_id='wBNFtaVQVr61m7', created_by_id='DzTjkKse')
πŸ’‘ file in storage 'mydata' with key 'schmidt22-crispra-gws-IFNG.csv'

Hit identification in notebook #

Access, transform & register data in drylab by testuser2:

ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)
# access
input_file = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
# identify hits
input_df = input_file.load().set_index("id")
output_df = input_df[input_df["pos|fdr"] < 0.01].copy()
# register hits in output file
ln.File(output_df, description="hits from schmidt22 crispra GWS").save()
Hide code cell output
βœ… logged in with email testuser2@lamin.ai and id bKeW4T6E
βœ… saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-09-05 09:42:20)
βœ… saved: Transform(id='tJG0vmmWAsib0o', name='GWS CRIPSRa analysis', type='notebook', updated_at=2023-09-05 09:42:20, created_by_id='bKeW4T6E')
βœ… saved: Run(id='HF8uBxPJOu3oqaeQC2cA', run_at=2023-09-05 09:42:20, transform_id='tJG0vmmWAsib0o', created_by_id='bKeW4T6E')
πŸ’‘ adding file QqiEFIPBkCdUKOFYelrz as input for run HF8uBxPJOu3oqaeQC2cA, adding parent transform wBNFtaVQVr61m7
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/hlSmmFWbA3fT1rfIMiGD.parquet')
πŸ’‘ data is a dataframe, consider using .from_df() to link column names as features
βœ… storing file 'hlSmmFWbA3fT1rfIMiGD' at '.lamindb/hlSmmFWbA3fT1rfIMiGD.parquet'

Inspect data flow:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_flow()
https://d33wubrfki0l68.cloudfront.net/2fbe78b4467692daca11c04823e46fcf566e7e63/388c9/_images/9993a68c1eaedb5b6566cd840fc6fafe99fe4157bf8225ac61f63cdb50bb5df6.svg

Sequencer upload #

Upload files from sequencer:

ln.setup.login("testuser1")
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
# register output files of upload
upload_dir = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.File(upload_dir.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(upload_dir.parent / "fastq/perturbseq_R2_001.fastq.gz").save()
ln.setup.login("testuser2")
Hide code cell output
βœ… logged in with email testuser1@lamin.ai and id DzTjkKse
βœ… saved: Transform(id='KqMfv7kntvpUYf', name='Chromium 10x upload', type='pipeline', updated_at=2023-09-05 09:42:22, created_by_id='DzTjkKse')
βœ… saved: Run(id='yOZnXri3kXvYg5oKQuAy', run_at=2023-09-05 09:42:22, transform_id='KqMfv7kntvpUYf', created_by_id='DzTjkKse')
❗ file has more than one suffix (path.suffixes), inferring:'.fastq.gz'
πŸ’‘ file in storage 'mydata' with key 'fastq/perturbseq_R1_001.fastq.gz'
❗ file has more than one suffix (path.suffixes), inferring:'.fastq.gz'
πŸ’‘ file in storage 'mydata' with key 'fastq/perturbseq_R2_001.fastq.gz'
βœ… logged in with email testuser2@lamin.ai and id bKeW4T6E

scRNA-seq bioinformatics pipeline #

Process uploaded files using a script or workflow manager: Pipelines and obtain 3 output files in a directory filtered_feature_bc_matrix/:

transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")
ln.track(transform)
# access uploaded files as inputs for the pipeline
input_files = ln.File.filter(key__startswith="fastq/perturbseq").all()
input_paths = [file.stage() for file in input_files]
# register output files
output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)
Hide code cell output
βœ… saved: Transform(id='JRWbXQhR4erjb0', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-09-05 09:42:24, created_by_id='bKeW4T6E')
βœ… saved: Run(id='67o0j7ZZjmFeZtoLAzHw', run_at=2023-09-05 09:42:24, transform_id='JRWbXQhR4erjb0', created_by_id='bKeW4T6E')
πŸ’‘ adding file hZm9RFhjApgUm1vCC6gp as input for run 67o0j7ZZjmFeZtoLAzHw, adding parent transform KqMfv7kntvpUYf
πŸ’‘ adding file K8C1oxvTfwKa4Wk5rw84 as input for run 67o0j7ZZjmFeZtoLAzHw, adding parent transform KqMfv7kntvpUYf
❗ file has more than one suffix (path.suffixes), inferring:'.tsv.gz'
❗ file has more than one suffix (path.suffixes), inferring:'.tsv.gz'
❗ file has more than one suffix (path.suffixes), using only last suffix: '.gz'
βœ… created 3 files from directory using storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata and key = perturbseq/filtered_feature_bc_matrix/

Post-process these 3 files:

transform = ln.Transform(name="Postprocess Cell Ranger", version="2.0", type="pipeline")
ln.track(transform)
input_files = [f.stage() for f in output_files]
output_path = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
output_file = ln.File(output_path, description="perturbseq counts")
output_file.save()
Hide code cell output
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
Cell Ranger JRWbXQhR4erjb0 90.0
βœ… saved: Transform(id='n6dAT1x2WfMjd5', name='Postprocess Cell Ranger', version='2.0', type='pipeline', updated_at=2023-09-05 09:42:24, created_by_id='bKeW4T6E')
βœ… saved: Run(id='5EiTNxhsaurSHWXVXuUo', run_at=2023-09-05 09:42:24, transform_id='n6dAT1x2WfMjd5', created_by_id='bKeW4T6E')
πŸ’‘ adding file GmMx0Grahn4dXmmh5NMp as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
πŸ’‘ adding file rzIgE34O7VirmEbMPa5f as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
πŸ’‘ adding file mPVMtwHx6poqLcEMlrGV as input for run 5EiTNxhsaurSHWXVXuUo, adding parent transform JRWbXQhR4erjb0
πŸ’‘ file in storage 'mydata' with key 'schmidt22_perturbseq.h5ad'
πŸ’‘ data is AnnDataLike, consider using .from_anndata() to link var_names and obs.columns as features

Inspect data flow:

output_files[0].view_flow()
https://d33wubrfki0l68.cloudfront.net/b43d83853c80e9b655bf9210136f826dbe0efa57/ba1b2/_images/0495dda5e1f5122398c99fe432a2fda4ca4e40d9f5b2a00a6ae91eec95c0c3b8.svg

Integrate scRNA-seq & phenotypic data #

Integrate data in a notebook:

transform = ln.Transform(
    name="Perform single cell analysis, integrate with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
file_hits = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
Hide code cell output
❗ records with similar names exist! did you mean to load one of them?
id __ratio__
name
Cell Ranger JRWbXQhR4erjb0 85.5
GWS CRIPSRa analysis tJG0vmmWAsib0o 85.5
Postprocess Cell Ranger n6dAT1x2WfMjd5 85.5
Upload GWS CRISPRa result wBNFtaVQVr61m7 85.5
βœ… saved: Transform(id='MQ6l5hB4xg9xLT', name='Perform single cell analysis, integrate with CRISPRa screen', type='notebook', updated_at=2023-09-05 09:42:25, created_by_id='bKeW4T6E')
βœ… saved: Run(id='RMCrFvPxOQimm3gbjSG8', run_at=2023-09-05 09:42:25, transform_id='MQ6l5hB4xg9xLT', created_by_id='bKeW4T6E')
πŸ’‘ adding file eLZEJDOwapnJA6AHGxT0 as input for run RMCrFvPxOQimm3gbjSG8, adding parent transform n6dAT1x2WfMjd5
πŸ’‘ adding file hlSmmFWbA3fT1rfIMiGD as input for run RMCrFvPxOQimm3gbjSG8, adding parent transform tJG0vmmWAsib0o
WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png
πŸ’‘ file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'
βœ… storing file 'G8K8uJCveZyQb6Gu9FVY' at 'figures/umap_fig1_score-wgs-hits.png'
WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png
πŸ’‘ file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'
βœ… storing file 'JECTpprgQOzf4g6aT7VX' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

Review results#

Let’s load one of the plots:

ln.track()
file = ln.File.filter(key__contains="figures/matrixplot").one()
file.stage()
Hide code cell output
πŸ’‘ notebook imports: ipython==8.15.0 lamindb==0.52.2 scanpy==1.9.4
βœ… saved: Transform(id='1LCd8kco9lZUz8', name='Project flow', short_name='project-flow', version='0', type=notebook, updated_at=2023-09-05 09:42:28, created_by_id='bKeW4T6E')
βœ… saved: Run(id='zKwc8B9EEvpIkwWkfuHm', run_at=2023-09-05 09:42:28, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')
πŸ’‘ adding file JECTpprgQOzf4g6aT7VX as input for run zKwc8B9EEvpIkwWkfuHm, adding parent transform MQ6l5hB4xg9xLT
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/figures/matrixplot_fig2_score-wgs-hits-per-cluster.png')
display(Image(filename=file.path))
https://d33wubrfki0l68.cloudfront.net/dcbd1e67232f2ede82171ba02237575cc586c2b7/1ceff/_images/45891ad4693b5bfeb52a48b2ab2e5d0a82220b9482360ee1a8757fad581fffdc.png

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_flow()
https://d33wubrfki0l68.cloudfront.net/dbcbd4f53e26a9c7d5544d488ece8dddaa1c0daf/261fb/_images/a2621ca2b2ef9c197565a2d134d24685a1dfa5fe46a911c86bedfb49e1c531df.svg

Alternatively, we can also look at the sequence of transforms:

transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()
transform.parents.df()
name short_name version type reference initial_version_id updated_at created_by_id
id
n6dAT1x2WfMjd5 Postprocess Cell Ranger None 2.0 pipeline None None 2023-09-05 09:42:25 bKeW4T6E
tJG0vmmWAsib0o GWS CRIPSRa analysis None None notebook None None 2023-09-05 09:42:20 bKeW4T6E
transform.view_parents()
https://d33wubrfki0l68.cloudfront.net/f79a9007f1d4548ed1034bf75d0b3f6e6f2ef61f/9e3f0/_images/c7a20f4b00c2b94618ac187acb7e36681e4e3c4a8dbbc41799ea99b1341688e7.svg

Understand runs#

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

File objects are the inputs and outputs of runs.

What if I don’t want a global context?

Sometimes, we don’t want to create a global run context but manually pass a run when creating a file:

run = ln.Run(transform=transform)
ln.File(filepath, run=run)
When does a file appear as a run input?

When accessing a file via stage(), load() or backed(), two things happen:

  1. The current run gets added to file.input_of

  2. The transform of that file gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I disable tracking run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

file.load(is_run_input=True)

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
hlSmmFWbA3fT1rfIMiGD 9XR1I5WM None .parquet DataFrame hits from schmidt22 crispra GWS None 18368 O2Owo0_QlM9JBS2zAZD4Lw md5 tJG0vmmWAsib0o HF8uBxPJOu3oqaeQC2cA None 2023-09-05 09:42:20 bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform
Transform(id='wBNFtaVQVr61m7', name='Upload GWS CRISPRa result', type='app', updated_at=2023-09-05 09:42:18, created_by_id='DzTjkKse')

And which user?

file.created_by
User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-05 09:42:22)

Which transforms were created by a given user?

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser2).df()
name short_name version type reference initial_version_id updated_at created_by_id
id
tJG0vmmWAsib0o GWS CRIPSRa analysis None None notebook None None 2023-09-05 09:42:20 bKeW4T6E
JRWbXQhR4erjb0 Cell Ranger None 7.2.0 pipeline None None 2023-09-05 09:42:24 bKeW4T6E
n6dAT1x2WfMjd5 Postprocess Cell Ranger None 2.0 pipeline None None 2023-09-05 09:42:25 bKeW4T6E
MQ6l5hB4xg9xLT Perform single cell analysis, integrate with C... None None notebook None None 2023-09-05 09:42:27 bKeW4T6E
1LCd8kco9lZUz8 Project flow project-flow 0 notebook None None 2023-09-05 09:42:28 bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()
name short_name version type reference initial_version_id updated_at created_by_id
id
tJG0vmmWAsib0o GWS CRIPSRa analysis None None notebook None None 2023-09-05 09:42:20 bKeW4T6E
MQ6l5hB4xg9xLT Perform single cell analysis, integrate with C... None None notebook None None 2023-09-05 09:42:27 bKeW4T6E
1LCd8kco9lZUz8 Project flow project-flow 0 notebook None None 2023-09-05 09:42:28 bKeW4T6E

We can also view all recent additions to the entire database:

ln.view()
Hide code cell output
File

storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
JECTpprgQOzf4g6aT7VX 9XR1I5WM figures/matrixplot_fig2_score-wgs-hits-per-clu... .png None None None 28814 JYIPcat0YWYVCX3RVd3mww md5 MQ6l5hB4xg9xLT RMCrFvPxOQimm3gbjSG8 None 2023-09-05 09:42:27 bKeW4T6E
G8K8uJCveZyQb6Gu9FVY 9XR1I5WM figures/umap_fig1_score-wgs-hits.png .png None None None 118999 laQjVk4gh70YFzaUyzbUNg md5 MQ6l5hB4xg9xLT RMCrFvPxOQimm3gbjSG8 None 2023-09-05 09:42:27 bKeW4T6E
eLZEJDOwapnJA6AHGxT0 9XR1I5WM schmidt22_perturbseq.h5ad .h5ad AnnData perturbseq counts None 20659936 la7EvqEUMDlug9-rpw-udA md5 n6dAT1x2WfMjd5 5EiTNxhsaurSHWXVXuUo None 2023-09-05 09:42:25 bKeW4T6E
rzIgE34O7VirmEbMPa5f 9XR1I5WM perturbseq/filtered_feature_bc_matrix/features... .tsv.gz None None None 6 -I3mht0rimsP3zbzNsyZjg md5 JRWbXQhR4erjb0 67o0j7ZZjmFeZtoLAzHw None 2023-09-05 09:42:24 bKeW4T6E
mPVMtwHx6poqLcEMlrGV 9XR1I5WM perturbseq/filtered_feature_bc_matrix/matrix.m... .gz None None None 6 Owckv-3wNqtc2lH5QLTKQg md5 JRWbXQhR4erjb0 67o0j7ZZjmFeZtoLAzHw None 2023-09-05 09:42:24 bKeW4T6E
GmMx0Grahn4dXmmh5NMp 9XR1I5WM perturbseq/filtered_feature_bc_matrix/barcodes... .tsv.gz None None None 6 tErw-NvJmpLGIvbI2Rv7Aw md5 JRWbXQhR4erjb0 67o0j7ZZjmFeZtoLAzHw None 2023-09-05 09:42:24 bKeW4T6E
K8C1oxvTfwKa4Wk5rw84 9XR1I5WM fastq/perturbseq_R2_001.fastq.gz .fastq.gz None None None 6 ntMqLgB2ytCIza2x1q_gDg md5 KqMfv7kntvpUYf yOZnXri3kXvYg5oKQuAy None 2023-09-05 09:42:22 DzTjkKse
Run

transform_id run_at created_by_id reference reference_type
id
AlYtOZxNKqqtrNbbbHjU wBNFtaVQVr61m7 2023-09-05 09:42:18 DzTjkKse None None
HF8uBxPJOu3oqaeQC2cA tJG0vmmWAsib0o 2023-09-05 09:42:20 bKeW4T6E None None
yOZnXri3kXvYg5oKQuAy KqMfv7kntvpUYf 2023-09-05 09:42:22 DzTjkKse None None
67o0j7ZZjmFeZtoLAzHw JRWbXQhR4erjb0 2023-09-05 09:42:24 bKeW4T6E None None
5EiTNxhsaurSHWXVXuUo n6dAT1x2WfMjd5 2023-09-05 09:42:24 bKeW4T6E None None
RMCrFvPxOQimm3gbjSG8 MQ6l5hB4xg9xLT 2023-09-05 09:42:25 bKeW4T6E None None
zKwc8B9EEvpIkwWkfuHm 1LCd8kco9lZUz8 2023-09-05 09:42:28 bKeW4T6E None None
Storage

root type region updated_at created_by_id
id
9XR1I5WM /home/runner/work/lamin-usecases/lamin-usecase... local None 2023-09-05 09:42:14 DzTjkKse
Transform

name short_name version type reference initial_version_id updated_at created_by_id
id
1LCd8kco9lZUz8 Project flow project-flow 0 notebook None None 2023-09-05 09:42:28 bKeW4T6E
MQ6l5hB4xg9xLT Perform single cell analysis, integrate with C... None None notebook None None 2023-09-05 09:42:27 bKeW4T6E
n6dAT1x2WfMjd5 Postprocess Cell Ranger None 2.0 pipeline None None 2023-09-05 09:42:25 bKeW4T6E
JRWbXQhR4erjb0 Cell Ranger None 7.2.0 pipeline None None 2023-09-05 09:42:24 bKeW4T6E
KqMfv7kntvpUYf Chromium 10x upload None None pipeline None None 2023-09-05 09:42:22 DzTjkKse
tJG0vmmWAsib0o GWS CRIPSRa analysis None None notebook None None 2023-09-05 09:42:20 bKeW4T6E
wBNFtaVQVr61m7 Upload GWS CRISPRa result None None app None None 2023-09-05 09:42:18 DzTjkKse
User

handle email name updated_at
id
bKeW4T6E testuser2 testuser2@lamin.ai Test User2 2023-09-05 09:42:24
DzTjkKse testuser1 testuser1@lamin.ai Test User1 2023-09-05 09:42:22
Hide code cell content
!lamin login testuser1
!lamin delete --force mydata
!rm -r ./mydata
βœ… logged in with email testuser1@lamin.ai and id DzTjkKse
πŸ’‘ deleting instance testuser1/mydata
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--mydata.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata