Data collator that will dynamically pad the inputs received, as well as the labels. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. model ([`PreTrainedModel`], *optional*): The model
| 485 | |
| 486 | @dataclass |
| 487 | class DataCollatorForSeq2Seq: |
| 488 | """ |
| 489 | Data collator that will dynamically pad the inputs received, as well as the labels. |
| 490 | |
| 491 | Args: |
| 492 | tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): |
| 493 | The tokenizer used for encoding the data. |
| 494 | model ([`PreTrainedModel`], *optional*): |
| 495 | The model that is being trained. If set and has the *prepare_decoder_input_ids_from_labels*, use it to |
| 496 | prepare the *decoder_input_ids* |
| 497 | |
| 498 | This is useful when using *label_smoothing* to avoid calculating loss twice. |
| 499 | padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`): |
| 500 | Select a strategy to pad the returned sequences (according to the model's padding side and padding index) |
| 501 | among: |
| 502 | |
| 503 | - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single |
| 504 | sequence is provided). |
| 505 | - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum |
| 506 | acceptable input length for the model if that argument is not provided. |
| 507 | - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths). |
| 508 | max_length (`int`, *optional*): |
| 509 | Maximum length of the returned list and optionally padding length (see above). |
| 510 | pad_to_multiple_of (`int`, *optional*): |
| 511 | If set will pad the sequence to a multiple of the provided value. |
| 512 | |
| 513 | This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= |
| 514 | 7.0 (Volta). |
| 515 | label_pad_token_id (`int`, *optional*, defaults to -100): |
| 516 | The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions). |
| 517 | return_tensors (`str`, *optional*, defaults to `"pt"`): |
| 518 | The type of Tensor to return. Allowable values are "np", or "pt". |
| 519 | """ |
| 520 | |
| 521 | tokenizer: PreTrainedTokenizerBase |
| 522 | model: Any | None = None |
| 523 | padding: bool | str | PaddingStrategy = True |
| 524 | max_length: int | None = None |
| 525 | pad_to_multiple_of: int | None = None |
| 526 | label_pad_token_id: int = -100 |
| 527 | return_tensors: str = "pt" |
| 528 | |
| 529 | def __call__(self, features, return_tensors=None): |
| 530 | if return_tensors is None: |
| 531 | return_tensors = self.return_tensors |
| 532 | |
| 533 | label_name = "label" if "label" in features[0] else "labels" |
| 534 | labels = [feature[label_name] for feature in features] if label_name in features[0] else None |
| 535 | # reconvert list[None] to None if necessary |
| 536 | # this might occur when we pass {..., "labels": None} |
| 537 | if labels is not None and all(label is None for label in labels): |
| 538 | labels = None |
| 539 | non_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features] |
| 540 | |
| 541 | # run through tokenizer without labels to ensure no side effects |
| 542 | batch = pad_without_fast_tokenizer_warning( |
| 543 | self.tokenizer, |
| 544 | non_labels_features, |
no outgoing calls