This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using `bitsandbytes`. This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive. Currently only supports `LLM.int8()`, `FP4`, a
| 167 | |
| 168 | @dataclass |
| 169 | class BitsAndBytesConfig(QuantizationConfigMixin): |
| 170 | """ |
| 171 | This is a wrapper class about all possible attributes and features that you can play with a model that has been |
| 172 | loaded using `bitsandbytes`. |
| 173 | |
| 174 | This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive. |
| 175 | |
| 176 | Currently only supports `LLM.int8()`, `FP4`, and `NF4` quantization. If more methods are added to `bitsandbytes`, |
| 177 | then more arguments will be added to this class. |
| 178 | |
| 179 | Args: |
| 180 | load_in_8bit (`bool`, *optional*, defaults to `False`): |
| 181 | This flag is used to enable 8-bit quantization with LLM.int8(). |
| 182 | load_in_4bit (`bool`, *optional*, defaults to `False`): |
| 183 | This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from |
| 184 | `bitsandbytes`. |
| 185 | llm_int8_threshold (`float`, *optional*, defaults to 6.0): |
| 186 | This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix |
| 187 | Multiplication for Transformers at Scale` paper: https://huggingface.co/papers/2208.07339 Any hidden states |
| 188 | value that is above this threshold will be considered an outlier and the operation on those values will be |
| 189 | done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], |
| 190 | but there are some exceptional systematic outliers that are very differently distributed for large models. |
| 191 | These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of |
| 192 | magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, |
| 193 | but a lower threshold might be needed for more unstable models (small models, fine-tuning). |
| 194 | llm_int8_skip_modules (`list[str]`, *optional*): |
| 195 | An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as |
| 196 | Jukebox that has several heads in different places and not necessarily at the last position. For example |
| 197 | for `CausalLM` models, the last `lm_head` is typically kept in its original `dtype`. |
| 198 | llm_int8_enable_fp32_cpu_offload (`bool`, *optional*, defaults to `False`): |
| 199 | This flag is used for advanced use cases and users that are aware of this feature. If you want to split |
| 200 | your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use |
| 201 | this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 |
| 202 | operations will not be run on CPU. |
| 203 | llm_int8_has_fp16_weight (`bool`, *optional*, defaults to `False`): |
| 204 | This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not |
| 205 | have to be converted back and forth for the backward pass. |
| 206 | bnb_4bit_compute_dtype (`torch.dtype` or str, *optional*, defaults to `torch.float32`): |
| 207 | This sets the computational type which might be different than the input type. For example, inputs might be |
| 208 | fp32, but computation can be set to bf16 for speedups. |
| 209 | bnb_4bit_quant_type (`str`, *optional*, defaults to `"fp4"`): |
| 210 | This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types |
| 211 | which are specified by `fp4` or `nf4`. |
| 212 | bnb_4bit_use_double_quant (`bool`, *optional*, defaults to `False`): |
| 213 | This flag is used for nested quantization where the quantization constants from the first quantization are |
| 214 | quantized again. |
| 215 | bnb_4bit_quant_storage (`torch.dtype` or str, *optional*, defaults to `torch.uint8`): |
| 216 | This sets the storage type to pack the quanitzed 4-bit prarams. |
| 217 | kwargs (`dict[str, Any]`, *optional*): |
| 218 | Additional parameters from which to initialize the configuration object. |
| 219 | """ |
| 220 | |
| 221 | _exclude_attributes_at_init = ["_load_in_4bit", "_load_in_8bit", "quant_method"] |
| 222 | |
| 223 | def __init__( |
| 224 | self, |
| 225 | load_in_8bit=False, |
| 226 | load_in_4bit=False, |
no outgoing calls
searching dependent graphs…