MCPcopy
hub / github.com/huggingface/transformers / DataCollatorWithPadding

Class DataCollatorWithPadding

src/transformers/data/data_collator.py:191–239  ·  view source on GitHub ↗

Data collator that will dynamically pad the inputs received. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`)

Source from the content-addressed store, hash-verified

189
190@dataclass
191class DataCollatorWithPadding:
192 """
193 Data collator that will dynamically pad the inputs received.
194
195 Args:
196 tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
197 The tokenizer used for encoding the data.
198 padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
199 Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
200 among:
201
202 - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
203 sequence is provided).
204 - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
205 acceptable input length for the model if that argument is not provided.
206 - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
207 max_length (`int`, *optional*):
208 Maximum length of the returned list and optionally padding length (see above).
209 pad_to_multiple_of (`int`, *optional*):
210 If set will pad the sequence to a multiple of the provided value.
211
212 This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
213 7.0 (Volta).
214 return_tensors (`str`, *optional*, defaults to `"pt"`):
215 The type of Tensor to return. Allowable values are "np", or "pt".
216 """
217
218 tokenizer: PreTrainedTokenizerBase
219 padding: bool | str | PaddingStrategy = True
220 max_length: int | None = None
221 pad_to_multiple_of: int | None = None
222 return_tensors: str = "pt"
223
224 def __call__(self, features: list[dict[str, Any]]) -> dict[str, Any]:
225 batch = pad_without_fast_tokenizer_warning(
226 self.tokenizer,
227 features,
228 padding=self.padding,
229 max_length=self.max_length,
230 pad_to_multiple_of=self.pad_to_multiple_of,
231 return_tensors=self.return_tensors,
232 )
233 if "label" in batch:
234 batch["labels"] = batch["label"]
235 del batch["label"]
236 if "label_ids" in batch:
237 batch["labels"] = batch["label_ids"]
238 del batch["label_ids"]
239 return batch
240
241
242@dataclass

Callers 15

mainFunction · 0.90
test_dynamic_paddingMethod · 0.90
test_numpy_outputMethod · 0.90
test_immutabilityMethod · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90

Calls

no outgoing calls