MCPcopy
hub / github.com/huggingface/transformers / TrainingArguments

Class TrainingArguments

src/transformers/training_args.py:179–2848  ·  view source on GitHub ↗

Configuration class for controlling all aspects of model training with the Trainer. TrainingArguments centralizes all hyperparameters, optimization settings, logging preferences, and infrastructure choices needed for training. [`HfArgumentParser`] can turn this class into [argparse

Source from the content-addressed store, hash-verified

177
178@dataclass
179class TrainingArguments:
180 """
181 Configuration class for controlling all aspects of model training with the Trainer.
182 TrainingArguments centralizes all hyperparameters, optimization settings, logging preferences, and infrastructure choices needed for training.
183
184 [`HfArgumentParser`] can turn this class into
185 [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
186 command line.
187
188 Parameters:
189 output_dir (`str`, *optional*, defaults to `"trainer_output"`):
190 The output directory where the model predictions and checkpoints will be written.
191
192 > Training Duration and Batch Size
193
194 per_device_train_batch_size (`int`, *optional*, defaults to 8):
195 The batch size *per device*. The **global batch size** is computed as:
196 `per_device_train_batch_size * number_of_devices` in multi-GPU or distributed setups.
197 num_train_epochs(`float`, *optional*, defaults to 3.0):
198 Total number of training epochs to perform (if not an integer, will perform the decimal part percents of
199 the last epoch before stopping training).
200 max_steps (`int`, *optional*, defaults to -1):
201 Overrides `num_train_epochs`. If set to a positive number, the total number of training steps to perform.
202 For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until
203 `max_steps` is reached.
204
205 > Learning Rate & Scheduler
206
207 learning_rate (`float`, *optional*, defaults to 5e-5):
208 The initial learning rate for the optimizer. This is typically the peak learning rate when using a scheduler with warmup.
209 lr_scheduler_type (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`):
210 The learning rate scheduler type to use. See [`SchedulerType`] for all possible values. Common choices:
211 - "linear" = [`get_linear_schedule_with_warmup`]
212 - "cosine" = [`get_cosine_schedule_with_warmup`]
213 - "constant" = [`get_constant_schedule`]
214 - "constant_with_warmup" = [`get_constant_schedule_with_warmup`]
215 lr_scheduler_kwargs (`dict` or `str`, *optional*, defaults to `None`):
216 The extra arguments for the lr_scheduler. See the documentation of each scheduler for possible values.
217 warmup_steps (`int` or `float`, *optional*, defaults to 0):
218 Number of steps for a linear warmup from 0 to `learning_rate`. Warmup helps stabilize training in the initial phase. Can be:
219 - An integer: exact number of warmup steps
220 - A float in range [0, 1): interpreted as ratio of total training steps
221
222 > Optimizer
223
224 optim (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_torch"` (for torch>=2.8 `"adamw_torch_fused"`)):
225 The optimizer to use. Common options:
226 - `"adamw_torch"`: PyTorch's AdamW (recommended default)
227 - `"adamw_torch_fused"`: Fused AdamW kernel
228 - `"adamw_hf"`: HuggingFace's AdamW implementation
229 - `"sgd"`: Stochastic Gradient Descent with momentum
230 - `"adafactor"`: Memory-efficient optimizer for large models
231 - `"adamw_8bit"`: 8-bit AdamW (requires bitsandbytes)
232 See [`OptimizerNames`] for the complete list.
233 optim_args (`str`, *optional*):
234 Optional arguments that are supplied to optimizers such as AnyPrecisionAdamW, AdEMAMix, and GaLore.
235 weight_decay (`float`, *optional*, defaults to 0):
236 Weight decay coefficient applied by the optimizer (not the loss function). Adds L2

Calls 2

is_torch_availableFunction · 0.85
keysMethod · 0.45