This debug class helps detect and understand where the model starts getting very large or very small, and more importantly `nan` or `inf` weight and activation elements. There are 2 working modes: 1. Underflow/overflow detection (default) 2. Specific batch absolute min/max tra
| 25 | |
| 26 | |
| 27 | class DebugUnderflowOverflow: |
| 28 | """ |
| 29 | This debug class helps detect and understand where the model starts getting very large or very small, and more |
| 30 | importantly `nan` or `inf` weight and activation elements. |
| 31 | |
| 32 | There are 2 working modes: |
| 33 | |
| 34 | 1. Underflow/overflow detection (default) |
| 35 | 2. Specific batch absolute min/max tracing without detection |
| 36 | |
| 37 | Mode 1: Underflow/overflow detection |
| 38 | |
| 39 | To activate the underflow/overflow detection, initialize the object with the model : |
| 40 | |
| 41 | ```python |
| 42 | debug_overflow = DebugUnderflowOverflow(model) |
| 43 | ``` |
| 44 | |
| 45 | then run the training as normal and if `nan` or `inf` gets detected in at least one of the weight, input or output |
| 46 | elements this module will throw an exception and will print `max_frames_to_save` frames that lead to this event, |
| 47 | each frame reporting |
| 48 | |
| 49 | 1. the fully qualified module name plus the class name whose `forward` was run |
| 50 | 2. the absolute min and max value of all elements for each module weights, and the inputs and output |
| 51 | |
| 52 | For example, here is the header and the last few frames in detection report for `google/mt5-small` run in fp16 |
| 53 | mixed precision : |
| 54 | |
| 55 | ``` |
| 56 | Detected inf/nan during batch_number=0 |
| 57 | Last 21 forward frames: |
| 58 | abs min abs max metadata |
| 59 | [...] |
| 60 | encoder.block.2.layer.1.DenseReluDense.wi_0 Linear |
| 61 | 2.17e-07 4.50e+00 weight |
| 62 | 1.79e-06 4.65e+00 input[0] |
| 63 | 2.68e-06 3.70e+01 output |
| 64 | encoder.block.2.layer.1.DenseReluDense.wi_1 Linear |
| 65 | 8.08e-07 2.66e+01 weight |
| 66 | 1.79e-06 4.65e+00 input[0] |
| 67 | 1.27e-04 2.37e+02 output |
| 68 | encoder.block.2.layer.1.DenseReluDense.wo Linear |
| 69 | 1.01e-06 6.44e+00 weight |
| 70 | 0.00e+00 9.74e+03 input[0] |
| 71 | 3.18e-04 6.27e+04 output |
| 72 | encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense |
| 73 | 1.79e-06 4.65e+00 input[0] |
| 74 | 3.18e-04 6.27e+04 output |
| 75 | encoder.block.2.layer.1.dropout Dropout |
| 76 | 3.18e-04 6.27e+04 input[0] |
| 77 | 0.00e+00 inf output |
| 78 | ``` |
| 79 | |
| 80 | You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was |
| 81 | around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which |
| 82 | renormalizes the weights, after it zeroed some of the elements, which pushes the absolute max value to more than |
| 83 | 64K, and we get an overflow. |
| 84 |