Configuration class for controlling all aspects of model training with the Trainer. TrainingArguments centralizes all hyperparameters, optimization settings, logging preferences, and infrastructure choices needed for training. [`HfArgumentParser`] can turn this class into [argparse
| 177 | |
| 178 | @dataclass |
| 179 | class TrainingArguments: |
| 180 | """ |
| 181 | Configuration class for controlling all aspects of model training with the Trainer. |
| 182 | TrainingArguments centralizes all hyperparameters, optimization settings, logging preferences, and infrastructure choices needed for training. |
| 183 | |
| 184 | [`HfArgumentParser`] can turn this class into |
| 185 | [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the |
| 186 | command line. |
| 187 | |
| 188 | Parameters: |
| 189 | output_dir (`str`, *optional*, defaults to `"trainer_output"`): |
| 190 | The output directory where the model predictions and checkpoints will be written. |
| 191 | |
| 192 | > Training Duration and Batch Size |
| 193 | |
| 194 | per_device_train_batch_size (`int`, *optional*, defaults to 8): |
| 195 | The batch size *per device*. The **global batch size** is computed as: |
| 196 | `per_device_train_batch_size * number_of_devices` in multi-GPU or distributed setups. |
| 197 | num_train_epochs(`float`, *optional*, defaults to 3.0): |
| 198 | Total number of training epochs to perform (if not an integer, will perform the decimal part percents of |
| 199 | the last epoch before stopping training). |
| 200 | max_steps (`int`, *optional*, defaults to -1): |
| 201 | Overrides `num_train_epochs`. If set to a positive number, the total number of training steps to perform. |
| 202 | For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until |
| 203 | `max_steps` is reached. |
| 204 | |
| 205 | > Learning Rate & Scheduler |
| 206 | |
| 207 | learning_rate (`float`, *optional*, defaults to 5e-5): |
| 208 | The initial learning rate for the optimizer. This is typically the peak learning rate when using a scheduler with warmup. |
| 209 | lr_scheduler_type (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`): |
| 210 | The learning rate scheduler type to use. See [`SchedulerType`] for all possible values. Common choices: |
| 211 | - "linear" = [`get_linear_schedule_with_warmup`] |
| 212 | - "cosine" = [`get_cosine_schedule_with_warmup`] |
| 213 | - "constant" = [`get_constant_schedule`] |
| 214 | - "constant_with_warmup" = [`get_constant_schedule_with_warmup`] |
| 215 | lr_scheduler_kwargs (`dict` or `str`, *optional*, defaults to `None`): |
| 216 | The extra arguments for the lr_scheduler. See the documentation of each scheduler for possible values. |
| 217 | warmup_steps (`int` or `float`, *optional*, defaults to 0): |
| 218 | Number of steps for a linear warmup from 0 to `learning_rate`. Warmup helps stabilize training in the initial phase. Can be: |
| 219 | - An integer: exact number of warmup steps |
| 220 | - A float in range [0, 1): interpreted as ratio of total training steps |
| 221 | |
| 222 | > Optimizer |
| 223 | |
| 224 | optim (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_torch"` (for torch>=2.8 `"adamw_torch_fused"`)): |
| 225 | The optimizer to use. Common options: |
| 226 | - `"adamw_torch"`: PyTorch's AdamW (recommended default) |
| 227 | - `"adamw_torch_fused"`: Fused AdamW kernel |
| 228 | - `"adamw_hf"`: HuggingFace's AdamW implementation |
| 229 | - `"sgd"`: Stochastic Gradient Descent with momentum |
| 230 | - `"adafactor"`: Memory-efficient optimizer for large models |
| 231 | - `"adamw_8bit"`: 8-bit AdamW (requires bitsandbytes) |
| 232 | See [`OptimizerNames`] for the complete list. |
| 233 | optim_args (`str`, *optional*): |
| 234 | Optional arguments that are supplied to optimizers such as AnyPrecisionAdamW, AdEMAMix, and GaLore. |
| 235 | weight_decay (`float`, *optional*, defaults to 0): |
| 236 | Weight decay coefficient applied by the optimizer (not the loss function). Adds L2 |