MCPcopy
hub / github.com/huggingface/transformers / DataCollatorForSeq2Seq

Class DataCollatorForSeq2Seq

src/transformers/data/data_collator.py:487–615  ·  view source on GitHub ↗

Data collator that will dynamically pad the inputs received, as well as the labels. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. model ([`PreTrainedModel`], *optional*): The model

Source from the content-addressed store, hash-verified

485
486@dataclass
487class DataCollatorForSeq2Seq:
488 """
489 Data collator that will dynamically pad the inputs received, as well as the labels.
490
491 Args:
492 tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
493 The tokenizer used for encoding the data.
494 model ([`PreTrainedModel`], *optional*):
495 The model that is being trained. If set and has the *prepare_decoder_input_ids_from_labels*, use it to
496 prepare the *decoder_input_ids*
497
498 This is useful when using *label_smoothing* to avoid calculating loss twice.
499 padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
500 Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
501 among:
502
503 - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
504 sequence is provided).
505 - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
506 acceptable input length for the model if that argument is not provided.
507 - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
508 max_length (`int`, *optional*):
509 Maximum length of the returned list and optionally padding length (see above).
510 pad_to_multiple_of (`int`, *optional*):
511 If set will pad the sequence to a multiple of the provided value.
512
513 This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
514 7.0 (Volta).
515 label_pad_token_id (`int`, *optional*, defaults to -100):
516 The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).
517 return_tensors (`str`, *optional*, defaults to `"pt"`):
518 The type of Tensor to return. Allowable values are "np", or "pt".
519 """
520
521 tokenizer: PreTrainedTokenizerBase
522 model: Any | None = None
523 padding: bool | str | PaddingStrategy = True
524 max_length: int | None = None
525 pad_to_multiple_of: int | None = None
526 label_pad_token_id: int = -100
527 return_tensors: str = "pt"
528
529 def __call__(self, features, return_tensors=None):
530 if return_tensors is None:
531 return_tensors = self.return_tensors
532
533 label_name = "label" if "label" in features[0] else "labels"
534 labels = [feature[label_name] for feature in features] if label_name in features[0] else None
535 # reconvert list[None] to None if necessary
536 # this might occur when we pass {..., "labels": None}
537 if labels is not None and all(label is None for label in labels):
538 labels = None
539 non_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]
540
541 # run through tokenizer without labels to ensure no side effects
542 batch = pad_without_fast_tokenizer_warning(
543 self.tokenizer,
544 non_labels_features,

Callers 15

test_return_sequencesMethod · 0.90
test_paddingMethod · 0.90
test_do_not_padMethod · 0.90
test_without_labelsMethod · 0.90
test_numpy_outputMethod · 0.90
test_immutabilityMethod · 0.90
mainFunction · 0.90

Calls

no outgoing calls

Tested by 11

test_return_sequencesMethod · 0.72
test_paddingMethod · 0.72
test_do_not_padMethod · 0.72
test_without_labelsMethod · 0.72
test_numpy_outputMethod · 0.72
test_immutabilityMethod · 0.72