Module molcrawl.protein_sequence.dataset.download_proteingym
Download ProteinGym v1.3 DMS substitution data for fine-tuning.
The downloaded CSV files are saved to save_path so that prepare_proteingym.py can read them directly.
Usage
from molcrawl.protein_sequence.dataset.download_proteingym import download_proteingym download_proteingym("/path/to/proteingym_v1.3")
Functions
def download_proteingym(save_path: str | pathlib.Path) ‑> pathlib.Path-
Expand source code
def download_proteingym(save_path: Union[str, Path]) -> Path: """ Download ProteinGym v1.3 DMS substitution CSV files into *save_path*. After this function returns, *save_path* will contain: DMS_ProteinGym_substitutions/ ← one CSV per assay DMS_substitutions.csv ← reference metadata (one row per assay) Args: save_path: Destination directory (created if missing). Returns: Path to the directory containing individual assay CSV files. """ save_path = Path(save_path) save_path.mkdir(parents=True, exist_ok=True) # --- substitutions zip -------------------------------------------------- zip_path = save_path / "DMS_ProteinGym_substitutions.zip" _download_with_progress(_SUBSTITUTIONS_URL, zip_path) csv_dir = save_path / "DMS_ProteinGym_substitutions" if not csv_dir.exists(): logger.info("Extracting %s ...", zip_path) with zipfile.ZipFile(zip_path, "r") as zf: zf.extractall(save_path) logger.info("Extracted to %s", csv_dir) else: logger.info("Already extracted: %s", csv_dir) # --- reference CSV ------------------------------------------------------ ref_path = save_path / "DMS_substitutions.csv" _download_with_progress(_REFERENCE_URL, ref_path) return csv_dirDownload ProteinGym v1.3 DMS substitution CSV files into save_path.
After this function returns, save_path will contain: DMS_ProteinGym_substitutions/ ← one CSV per assay DMS_substitutions.csv ← reference metadata (one row per assay)
Args
save_path- Destination directory (created if missing).
Returns
Path to the directory containing individual assay CSV files.