hub / github.com/huggingface/transformers / DataCollatorWithPadding

Class DataCollatorWithPadding

src/transformers/data/data_collator.py:191–239 · view source on GitHub ↗

Data collator that will dynamically pad the inputs received. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`)

Source from the content-addressed store, hash-verified

189
190	@dataclass
191	class DataCollatorWithPadding:
192	"""
193	Data collator that will dynamically pad the inputs received.
194
195	Args:
196	tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
197	The tokenizer used for encoding the data.
198	padding (`bool`, `str` or [`~utils.PaddingStrategy`], optional, defaults to `True`):
199	Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
200	among:
201
202	- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
203	sequence is provided).
204	- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
205	acceptable input length for the model if that argument is not provided.
206	- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
207	max_length (`int`, optional):
208	Maximum length of the returned list and optionally padding length (see above).
209	pad_to_multiple_of (`int`, optional):
210	If set will pad the sequence to a multiple of the provided value.
211
212	This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
213	7.0 (Volta).
214	return_tensors (`str`, optional, defaults to `"pt"`):
215	The type of Tensor to return. Allowable values are "np", or "pt".
216	"""
217
218	tokenizer: PreTrainedTokenizerBase
219	padding: bool \| str \| PaddingStrategy = True
220	max_length: int \| None = None
221	pad_to_multiple_of: int \| None = None
222	return_tensors: str = "pt"
223
224	def __call__(self, features: list[dict[str, Any]]) -> dict[str, Any]:
225	batch = pad_without_fast_tokenizer_warning(
226	self.tokenizer,
227	features,
228	padding=self.padding,
229	max_length=self.max_length,
230	pad_to_multiple_of=self.pad_to_multiple_of,
231	return_tensors=self.return_tensors,
232	)
233	if "label" in batch:
234	batch["labels"] = batch["label"]
235	del batch["label"]
236	if "label_ids" in batch:
237	batch["labels"] = batch["label_ids"]
238	del batch["label_ids"]
239	return batch
240
241
242	@dataclass

Callers 15

mainFunction · 0.90

test_dynamic_paddingMethod · 0.90

test_max_length_paddingMethod · 0.90

test_pad_to_multiple_ofMethod · 0.90

test_numpy_outputMethod · 0.90

test_attention_mask_generatedMethod · 0.90

test_immutabilityMethod · 0.90

test_4d_attention_mask_preservedMethod · 0.90

test_4d_attention_mask_asymmetric_kvMethod · 0.90

mainFunction · 0.90

Calls

no outgoing calls

Tested by 8

test_dynamic_paddingMethod · 0.72

test_max_length_paddingMethod · 0.72

test_pad_to_multiple_ofMethod · 0.72

test_numpy_outputMethod · 0.72

test_attention_mask_generatedMethod · 0.72

test_immutabilityMethod · 0.72

test_4d_attention_mask_preservedMethod · 0.72

test_4d_attention_mask_asymmetric_kvMethod · 0.72