Data collator that will dynamically pad the inputs received. Args: tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): The tokenizer used for encoding the data. padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`)
| 189 | |
| 190 | @dataclass |
| 191 | class DataCollatorWithPadding: |
| 192 | """ |
| 193 | Data collator that will dynamically pad the inputs received. |
| 194 | |
| 195 | Args: |
| 196 | tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): |
| 197 | The tokenizer used for encoding the data. |
| 198 | padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`): |
| 199 | Select a strategy to pad the returned sequences (according to the model's padding side and padding index) |
| 200 | among: |
| 201 | |
| 202 | - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single |
| 203 | sequence is provided). |
| 204 | - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum |
| 205 | acceptable input length for the model if that argument is not provided. |
| 206 | - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths). |
| 207 | max_length (`int`, *optional*): |
| 208 | Maximum length of the returned list and optionally padding length (see above). |
| 209 | pad_to_multiple_of (`int`, *optional*): |
| 210 | If set will pad the sequence to a multiple of the provided value. |
| 211 | |
| 212 | This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= |
| 213 | 7.0 (Volta). |
| 214 | return_tensors (`str`, *optional*, defaults to `"pt"`): |
| 215 | The type of Tensor to return. Allowable values are "np", or "pt". |
| 216 | """ |
| 217 | |
| 218 | tokenizer: PreTrainedTokenizerBase |
| 219 | padding: bool | str | PaddingStrategy = True |
| 220 | max_length: int | None = None |
| 221 | pad_to_multiple_of: int | None = None |
| 222 | return_tensors: str = "pt" |
| 223 | |
| 224 | def __call__(self, features: list[dict[str, Any]]) -> dict[str, Any]: |
| 225 | batch = pad_without_fast_tokenizer_warning( |
| 226 | self.tokenizer, |
| 227 | features, |
| 228 | padding=self.padding, |
| 229 | max_length=self.max_length, |
| 230 | pad_to_multiple_of=self.pad_to_multiple_of, |
| 231 | return_tensors=self.return_tensors, |
| 232 | ) |
| 233 | if "label" in batch: |
| 234 | batch["labels"] = batch["label"] |
| 235 | del batch["label"] |
| 236 | if "label_ids" in batch: |
| 237 | batch["labels"] = batch["label_ids"] |
| 238 | del batch["label_ids"] |
| 239 | return batch |
| 240 | |
| 241 | |
| 242 | @dataclass |
no outgoing calls