Module molcrawl.compounds.dataset.download_chembl

Download ChEMBL database (SQLite) and extract canonical SMILES strings.

Downloads ChEMBL 36 SQLite archive from the EBI FTP server, unpacks it, then queries the compound_structures table for canonical SMILES with valid, non-null values and writes them as one SMILES string per line into a text file ready for the subsequent prepare_chembl.py tokenisation step.

Output layout under output_dir ───────────────────────────────── chembl_db/ chembl_36_sqlite.tar.gz ← downloaded archive (kept for resumability) chembl_36.db ← unpacked SQLite database smiles.txt ← one canonical SMILES per line download_complete.marker ← written when this script finishes cleanly

Functions

def download_chembl(output_dir: str, force: bool = False) ‑> bool
Expand source code
def download_chembl(output_dir: str, force: bool = False) -> bool:
    """Download ChEMBL SQLite and extract canonical SMILES.

    Args:
        output_dir: Root directory for ChEMBL data (e.g. ``CHEMBL_SOURCE_DIR``).
        force: Re-run all steps even if marker file exists.

    Returns:
        ``True`` on success, ``False`` on failure.
    """
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)

    marker = out / "download_complete.marker"
    if not force and marker.exists():
        logger.info("ChEMBL download already completed. Skipping. (use force=True to re-run)")
        return True

    archive_path = out / CHEMBL_ARCHIVE_NAME
    db_path = out / CHEMBL_DB_NAME
    smiles_path = out / SMILES_FILE_NAME

    try:
        # Step 1 – download archive
        if not archive_path.exists():
            _download_with_progress(CHEMBL_ARCHIVE_URL, archive_path)
        else:
            logger.info(f"Archive already present at {archive_path}, skipping download.")

        # Step 2 – extract SQLite DB
        if not db_path.exists():
            db_path = _extract_db(archive_path, out)
        else:
            logger.info(f"SQLite DB already present at {db_path}, skipping extraction.")

        # Step 3 – extract canonical SMILES
        _extract_smiles(db_path, smiles_path)

        marker.touch()
        logger.info(f"ChEMBL download pipeline complete. Marker written to {marker}")
        return True

    except Exception as exc:
        logger.error(f"ChEMBL download failed: {exc}", exc_info=True)
        return False

Download ChEMBL SQLite and extract canonical SMILES.

Args

output_dir
Root directory for ChEMBL data (e.g. CHEMBL_SOURCE_DIR).
force
Re-run all steps even if marker file exists.

Returns

True on success, False on failure.