hub / github.com/huggingface/transformers / log_metrics

Function log_metrics

src/transformers/trainer_pt_utils.py:830–917 · view source on GitHub ↗

Log metrics in a specially formatted way. Under distributed environment this is done only for a process with rank 0. Args: split (`str`): Mode/split name: one of `train`, `eval`, `test` metrics (`dict[str, float]`): The metrics returned from tra

(self, split, metrics)

Source from the content-addressed store, hash-verified

828
829	# Trainer helper method: imported into the Trainer class and used as a method (takes `self` as first argument).
830	def log_metrics(self, split, metrics):
831	"""
832	Log metrics in a specially formatted way.
833
834	Under distributed environment this is done only for a process with rank 0.
835
836	Args:
837	split (`str`):
838	Mode/split name: one of `train`, `eval`, `test`
839	metrics (`dict[str, float]`):
840	The metrics returned from train/evaluate/predictmetrics: metrics dict
841
842	Notes on memory reports:
843
844	In order to get memory usage report you need to install `psutil`. You can do that with `pip install psutil`.
845
846	Now when this method is run, you will see a report that will include:
847
848	```
849	init_mem_cpu_alloc_delta = 1301MB
850	init_mem_cpu_peaked_delta = 154MB
851	init_mem_gpu_alloc_delta = 230MB
852	init_mem_gpu_peaked_delta = 0MB
853	train_mem_cpu_alloc_delta = 1345MB
854	train_mem_cpu_peaked_delta = 0MB
855	train_mem_gpu_alloc_delta = 693MB
856	train_mem_gpu_peaked_delta = 7MB
857	```
858
859	Understanding the reports:
860
861	- the first segment, e.g., `train__`, tells you which stage the metrics are for. Reports starting with `init_`
862	will be added to the first stage that gets run. So that if only evaluation is run, the memory usage for the
863	`__init__` will be reported along with the `eval_` metrics.
864	- the third segment, is either `cpu` or `gpu`, tells you whether it's the general RAM or the gpu0 memory
865	metric.
866	- `*_alloc_delta` - is the difference in the used/allocated memory counter between the end and the start of the
867	stage - it can be negative if a function released more memory than it allocated.
868	- `*_peaked_delta` - is any extra memory that was consumed and then freed - relative to the current allocated
869	memory counter - it is never negative. When you look at the metrics of any stage you add up `alloc_delta` +
870	`peaked_delta` and you know how much memory was needed to complete that stage.
871
872	The reporting happens only for process of rank 0 and gpu 0 (if there is a gpu). Typically this is enough since the
873	main process does the bulk of work, but it could be not quite so if model parallel is used and then other GPUs may
874	use a different amount of gpu memory. This is also not the same under DataParallel where gpu0 may require much more
875	memory than the rest since it stores the gradient and optimizer states for all participating GPUs. Perhaps in the
876	future these reports will evolve to measure those too.
877
878	The CPU RAM metric measures RSS (Resident Set Size) includes both the memory which is unique to the process and the
879	memory shared with other processes. It is important to note that it does not include swapped out memory, so the
880	reports could be imprecise.
881
882	The CPU peak memory is measured using a sampling thread. Due to python's GIL it may miss some of the peak memory if
883	that thread didn't get a chance to run when the highest memory was used. Therefore this report can be less than
884	reality. Using `tracemalloc` would have reported the exact peak memory, but it doesn't report memory allocations
885	outside of python. So if some C++ CUDA extension allocated its own memory it won't be reported. And therefore it
886	was dropped in favor of the memory sampling approach, which reads the current process memory usage.
887

Callers

nothing calls this directly

Calls 4

metrics_formatFunction · 0.85

is_world_process_zeroMethod · 0.80

valuesMethod · 0.45

keysMethod · 0.45

Tested by

no test coverage detected