Module molcrawl.compounds.dataset.download_chembl
Download ChEMBL database (SQLite) and extract canonical SMILES strings.
Downloads ChEMBL 36 SQLite archive from the EBI FTP server, unpacks it, then
queries the compound_structures table for canonical SMILES with valid,
non-null values and writes them as one SMILES string per line into a text file
ready for the subsequent prepare_chembl.py tokenisation step.
Output layout under output_dir ───────────────────────────────── chembl_db/ chembl_36_sqlite.tar.gz ← downloaded archive (kept for resumability) chembl_36.db ← unpacked SQLite database smiles.txt ← one canonical SMILES per line download_complete.marker ← written when this script finishes cleanly
Functions
def download_chembl(output_dir: str, force: bool = False) ‑> bool-
Expand source code
def download_chembl(output_dir: str, force: bool = False) -> bool: """Download ChEMBL SQLite and extract canonical SMILES. Args: output_dir: Root directory for ChEMBL data (e.g. ``CHEMBL_SOURCE_DIR``). force: Re-run all steps even if marker file exists. Returns: ``True`` on success, ``False`` on failure. """ out = Path(output_dir) out.mkdir(parents=True, exist_ok=True) marker = out / "download_complete.marker" if not force and marker.exists(): logger.info("ChEMBL download already completed. Skipping. (use force=True to re-run)") return True archive_path = out / CHEMBL_ARCHIVE_NAME db_path = out / CHEMBL_DB_NAME smiles_path = out / SMILES_FILE_NAME try: # Step 1 – download archive if not archive_path.exists(): _download_with_progress(CHEMBL_ARCHIVE_URL, archive_path) else: logger.info(f"Archive already present at {archive_path}, skipping download.") # Step 2 – extract SQLite DB if not db_path.exists(): db_path = _extract_db(archive_path, out) else: logger.info(f"SQLite DB already present at {db_path}, skipping extraction.") # Step 3 – extract canonical SMILES _extract_smiles(db_path, smiles_path) marker.touch() logger.info(f"ChEMBL download pipeline complete. Marker written to {marker}") return True except Exception as exc: logger.error(f"ChEMBL download failed: {exc}", exc_info=True) return FalseDownload ChEMBL SQLite and extract canonical SMILES.
Args
output_dir- Root directory for ChEMBL data (e.g.
CHEMBL_SOURCE_DIR). force- Re-run all steps even if marker file exists.
Returns
Trueon success,Falseon failure.