MCPcopy Index your code
hub / github.com/huggingface/diffusers / BitsAndBytesConfig

Class BitsAndBytesConfig

src/diffusers/quantizers/quantization_config.py:169–405  ·  view source on GitHub ↗

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using `bitsandbytes`. This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive. Currently only supports `LLM.int8()`, `FP4`, a

Source from the content-addressed store, hash-verified

167
168@dataclass
169class BitsAndBytesConfig(QuantizationConfigMixin):
170 """
171 This is a wrapper class about all possible attributes and features that you can play with a model that has been
172 loaded using `bitsandbytes`.
173
174 This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive.
175
176 Currently only supports `LLM.int8()`, `FP4`, and `NF4` quantization. If more methods are added to `bitsandbytes`,
177 then more arguments will be added to this class.
178
179 Args:
180 load_in_8bit (`bool`, *optional*, defaults to `False`):
181 This flag is used to enable 8-bit quantization with LLM.int8().
182 load_in_4bit (`bool`, *optional*, defaults to `False`):
183 This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from
184 `bitsandbytes`.
185 llm_int8_threshold (`float`, *optional*, defaults to 6.0):
186 This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix
187 Multiplication for Transformers at Scale` paper: https://huggingface.co/papers/2208.07339 Any hidden states
188 value that is above this threshold will be considered an outlier and the operation on those values will be
189 done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5],
190 but there are some exceptional systematic outliers that are very differently distributed for large models.
191 These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of
192 magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6,
193 but a lower threshold might be needed for more unstable models (small models, fine-tuning).
194 llm_int8_skip_modules (`list[str]`, *optional*):
195 An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as
196 Jukebox that has several heads in different places and not necessarily at the last position. For example
197 for `CausalLM` models, the last `lm_head` is typically kept in its original `dtype`.
198 llm_int8_enable_fp32_cpu_offload (`bool`, *optional*, defaults to `False`):
199 This flag is used for advanced use cases and users that are aware of this feature. If you want to split
200 your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use
201 this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8
202 operations will not be run on CPU.
203 llm_int8_has_fp16_weight (`bool`, *optional*, defaults to `False`):
204 This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not
205 have to be converted back and forth for the backward pass.
206 bnb_4bit_compute_dtype (`torch.dtype` or str, *optional*, defaults to `torch.float32`):
207 This sets the computational type which might be different than the input type. For example, inputs might be
208 fp32, but computation can be set to bf16 for speedups.
209 bnb_4bit_quant_type (`str`, *optional*, defaults to `"fp4"`):
210 This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types
211 which are specified by `fp4` or `nf4`.
212 bnb_4bit_use_double_quant (`bool`, *optional*, defaults to `False`):
213 This flag is used for nested quantization where the quantization constants from the first quantization are
214 quantized again.
215 bnb_4bit_quant_storage (`torch.dtype` or str, *optional*, defaults to `torch.uint8`):
216 This sets the storage type to pack the quanitzed 4-bit prarams.
217 kwargs (`dict[str, Any]`, *optional*):
218 Additional parameters from which to initialize the configuration object.
219 """
220
221 _exclude_attributes_at_init = ["_load_in_4bit", "_load_in_8bit", "quant_method"]
222
223 def __init__(
224 self,
225 load_in_8bit=False,
226 load_in_4bit=False,

Calls

no outgoing calls

Used in the wild real call sites across dependent graphs

searching dependent graphs…