Module molcrawl.rna.dataset.cellxgene.prepare_cellxgene

This script will download the cellxgene dataset. There will be multiple directory generate in the output_dir provided in the configuration

  • download_dir: Raw archive file downloaded from the cellxgene database
  • extract: h5ad file extracted from the archives
  • parquet_files: parquet files containing tokenized gene and expression values