Module molcrawl.rna.dataset.cellxgene.script.download_by_dataset
Per-dataset download strategy for CellxGene census data.
Problem with the original download.py approach: get_anndata(obs_coords=[5000 arbitrary soma_joinids]) → These IDs span many datasets/TileDB fragments → expensive random I/O.
This module instead: 1. Queries census obs metadata once per tissue (no X, fast) → soma_joinid→dataset_id 2. Groups our needed obs_ids by dataset_id 3. For each dataset: queries get_anndata in sub-batches of _MAX_CELLS_PER_BATCH cells, routes them into chunk buffers, and flushes complete chunks to disk immediately to keep peak memory low.
Compatible drop-in replacement for download.download().
Functions
def download(output_dir: str, version: str, num_worker: int, size_workload: int) ‑> None-
Expand source code
def download( output_dir: str, version: str, num_worker: int, size_workload: int, ) -> None: """Drop-in replacement for download.download() using per-dataset queries. Processes tissues sequentially; within each tissue, dataset queries run sequentially to avoid SOMA rate-limiting. num_worker is accepted for interface compatibility but currently unused at the tissue level. """ output_path = Path(output_dir) output_path.joinpath("download_dir").mkdir(exist_ok=True, parents=True) remaining = _find_remaining_chunks(output_path, size_workload) if not remaining: logger.info("All chunks already downloaded.") return total_chunks = sum(len(v) for v in remaining.values()) logger.info( f"download_by_dataset: {len(remaining)} tissues, " f"{total_chunks} remaining chunks" ) for tissue, n in sorted( ((t, len(c)) for t, c in remaining.items()), key=lambda x: -x[1] ): logger.info(f" {tissue}: {n} chunks remaining") total_written = 0 for tissue, chunks in sorted( remaining.items(), key=lambda x: -len(x[1]) ): logger.info(f"=== Processing tissue: {tissue} ({len(chunks)} chunks) ===") written = _process_tissue(output_path, version, tissue, chunks) total_written += written logger.info(f"[{tissue}] Done: {written}/{len(chunks)} chunks written") logger.info(f"download_by_dataset complete: {total_written}/{total_chunks} chunks written")Drop-in replacement for download.download() using per-dataset queries.
Processes tissues sequentially; within each tissue, dataset queries run sequentially to avoid SOMA rate-limiting. num_worker is accepted for interface compatibility but currently unused at the tissue level.