Module molcrawl.protein_sequence.utils.bert_tokenizer
BERT-compatible tokenizer wrapper for Protein Sequence Wrap the ESM tokenizer in a format compatible with BERT learning
Functions
def create_bert_protein_tokenizer(**kwargs) ‑> BertProteinSequenceTokenizer-
Expand source code
def create_bert_protein_tokenizer(**kwargs) -> BertProteinSequenceTokenizer: """ Create a BERT-compatible protein sequence tokenizer Returns: BertProteinSequenceTokenizer instance """ return BertProteinSequenceTokenizer(**kwargs)Create a BERT-compatible protein sequence tokenizer
Returns
BertProteinSequenceTokenizer instance
Classes
class BertProteinSequenceTokenizer (*args, **kwargs)-
Expand source code
class BertProteinSequenceTokenizer(EsmSequenceTokenizer): """ ESM tokenizer modified for BERT training compatibility This class wraps the original EsmSequenceTokenizer to make it compatible with BERT training by overriding model_input_names to use standard BERT format """ # Override model input names to use standard BERT format model_input_names = ["input_ids", "attention_mask"] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Ensure pad_token is set for BERT compatibility if not hasattr(self, "pad_token") or self.pad_token is None: self.pad_token = self.unk_token self.pad_token_id = self.unk_token_idESM tokenizer modified for BERT training compatibility
This class wraps the original EsmSequenceTokenizer to make it compatible with BERT training by overriding model_input_names to use standard BERT format
Ancestors
Class variables
var model_input_names