hub / github.com/huggingface/transformers / DataCollatorForSeq2Seq

Class DataCollatorForSeq2Seq

src/transformers/data/data_collator.py:487–615 · view source on GitHub ↗

Data collator that will dynamically pad the inputs received, as well as the labels. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. model ([`PreTrainedModel`], *optional*): The model

Source from the content-addressed store, hash-verified

485
486	@dataclass
487	class DataCollatorForSeq2Seq:
488	"""
489	Data collator that will dynamically pad the inputs received, as well as the labels.
490
491	Args:
492	tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
493	The tokenizer used for encoding the data.
494	model ([`PreTrainedModel`], optional):
495	The model that is being trained. If set and has the prepare_decoder_input_ids_from_labels, use it to
496	prepare the decoder_input_ids
497
498	This is useful when using label_smoothing to avoid calculating loss twice.
499	padding (`bool`, `str` or [`~utils.PaddingStrategy`], optional, defaults to `True`):
500	Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
501	among:
502
503	- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
504	sequence is provided).
505	- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
506	acceptable input length for the model if that argument is not provided.
507	- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
508	max_length (`int`, optional):
509	Maximum length of the returned list and optionally padding length (see above).
510	pad_to_multiple_of (`int`, optional):
511	If set will pad the sequence to a multiple of the provided value.
512
513	This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
514	7.0 (Volta).
515	label_pad_token_id (`int`, optional, defaults to -100):
516	The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).
517	return_tensors (`str`, optional, defaults to `"pt"`):
518	The type of Tensor to return. Allowable values are "np", or "pt".
519	"""
520
521	tokenizer: PreTrainedTokenizerBase
522	model: Any \| None = None
523	padding: bool \| str \| PaddingStrategy = True
524	max_length: int \| None = None
525	pad_to_multiple_of: int \| None = None
526	label_pad_token_id: int = -100
527	return_tensors: str = "pt"
528
529	def __call__(self, features, return_tensors=None):
530	if return_tensors is None:
531	return_tensors = self.return_tensors
532
533	label_name = "label" if "label" in features[0] else "labels"
534	labels = [feature[label_name] for feature in features] if label_name in features[0] else None
535	# reconvert list[None] to None if necessary
536	# this might occur when we pass {..., "labels": None}
537	if labels is not None and all(label is None for label in labels):
538	labels = None
539	non_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]
540
541	# run through tokenizer without labels to ensure no side effects
542	batch = pad_without_fast_tokenizer_warning(
543	self.tokenizer,
544	non_labels_features,

Callers 15

test_return_sequencesMethod · 0.90

test_bad_generation_config_fail_earlyMethod · 0.90

test_paddingMethod · 0.90

test_with_tensor_inputsMethod · 0.90

test_max_length_paddingMethod · 0.90

test_pad_to_multiple_ofMethod · 0.90

test_custom_label_pad_tokenMethod · 0.90

test_do_not_padMethod · 0.90

test_without_labelsMethod · 0.90

test_numpy_outputMethod · 0.90

test_immutabilityMethod · 0.90

mainFunction · 0.90

Calls

no outgoing calls

Tested by 11

test_return_sequencesMethod · 0.72

test_bad_generation_config_fail_earlyMethod · 0.72

test_paddingMethod · 0.72

test_with_tensor_inputsMethod · 0.72

test_max_length_paddingMethod · 0.72

test_pad_to_multiple_ofMethod · 0.72

test_custom_label_pad_tokenMethod · 0.72

test_do_not_padMethod · 0.72

test_without_labelsMethod · 0.72

test_numpy_outputMethod · 0.72

test_immutabilityMethod · 0.72