Module molcrawl.rna.dataset.cellxgene.prepare_cellxgene
This script will download the cellxgene dataset. There will be multiple directory generate in the output_dir provided in the configuration
- download_dir: Raw archive file downloaded from the cellxgene database
- extract: h5ad file extracted from the archives
- parquet_files: parquet files containing tokenized gene and expression values