hub / github.com/huggingface/diffusers / BitsAndBytesConfig

Class BitsAndBytesConfig

src/diffusers/quantizers/quantization_config.py:169–405 · view source on GitHub ↗

This is a wrapper class about all possible attributes and features that you can play with a model that has been loaded using `bitsandbytes`. This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive. Currently only supports `LLM.int8()`, `FP4`, a

Source from the content-addressed store, hash-verified

167
168	@dataclass
169	class BitsAndBytesConfig(QuantizationConfigMixin):
170	"""
171	This is a wrapper class about all possible attributes and features that you can play with a model that has been
172	loaded using `bitsandbytes`.
173
174	This replaces `load_in_8bit` or `load_in_4bit` therefore both options are mutually exclusive.
175
176	Currently only supports `LLM.int8()`, `FP4`, and `NF4` quantization. If more methods are added to `bitsandbytes`,
177	then more arguments will be added to this class.
178
179	Args:
180	load_in_8bit (`bool`, optional, defaults to `False`):
181	This flag is used to enable 8-bit quantization with LLM.int8().
182	load_in_4bit (`bool`, optional, defaults to `False`):
183	This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from
184	`bitsandbytes`.
185	llm_int8_threshold (`float`, optional, defaults to 6.0):
186	This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix
187	Multiplication for Transformers at Scale` paper: https://huggingface.co/papers/2208.07339 Any hidden states
188	value that is above this threshold will be considered an outlier and the operation on those values will be
189	done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5],
190	but there are some exceptional systematic outliers that are very differently distributed for large models.
191	These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of
192	magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6,
193	but a lower threshold might be needed for more unstable models (small models, fine-tuning).
194	llm_int8_skip_modules (`list[str]`, optional):
195	An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as
196	Jukebox that has several heads in different places and not necessarily at the last position. For example
197	for `CausalLM` models, the last `lm_head` is typically kept in its original `dtype`.
198	llm_int8_enable_fp32_cpu_offload (`bool`, optional, defaults to `False`):
199	This flag is used for advanced use cases and users that are aware of this feature. If you want to split
200	your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use
201	this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8
202	operations will not be run on CPU.
203	llm_int8_has_fp16_weight (`bool`, optional, defaults to `False`):
204	This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not
205	have to be converted back and forth for the backward pass.
206	bnb_4bit_compute_dtype (`torch.dtype` or str, optional, defaults to `torch.float32`):
207	This sets the computational type which might be different than the input type. For example, inputs might be
208	fp32, but computation can be set to bf16 for speedups.
209	bnb_4bit_quant_type (`str`, optional, defaults to `"fp4"`):
210	This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types
211	which are specified by `fp4` or `nf4`.
212	bnb_4bit_use_double_quant (`bool`, optional, defaults to `False`):
213	This flag is used for nested quantization where the quantization constants from the first quantization are
214	quantized again.
215	bnb_4bit_quant_storage (`torch.dtype` or str, optional, defaults to `torch.uint8`):
216	This sets the storage type to pack the quanitzed 4-bit prarams.
217	kwargs (`dict[str, Any]`, optional):
218	Additional parameters from which to initialize the configuration object.
219	"""
220
221	_exclude_attributes_at_init = ["_load_in_4bit", "_load_in_8bit", "quant_method"]
222
223	def __init__(
224	self,
225	load_in_8bit=False,
226	load_in_4bit=False,

Callers 15

test_quant_config_reprMethod · 0.90

setUpMethod · 0.90

test_model_memory_usageMethod · 0.90

test_keep_modules_in_fp32Method · 0.90

test_bnb_4bit_wrong_configMethod · 0.90

test_bnb_4bit_errors_loading_incorrect_state_dictMethod · 0.90

test_bnb_4bit_logs_warning_for_no_quantizationMethod · 0.90

setUpMethod · 0.90

test_moving_to_cpu_throws_warningMethod · 0.90

test_pipeline_cuda_placement_works_with_nf4Method · 0.90

test_device_mapMethod · 0.90

Calls

no outgoing calls

Tested by 15

test_quant_config_reprMethod · 0.72

setUpMethod · 0.72

test_model_memory_usageMethod · 0.72

test_keep_modules_in_fp32Method · 0.72

test_bnb_4bit_wrong_configMethod · 0.72

test_bnb_4bit_errors_loading_incorrect_state_dictMethod · 0.72

test_bnb_4bit_logs_warning_for_no_quantizationMethod · 0.72

setUpMethod · 0.72

test_moving_to_cpu_throws_warningMethod · 0.72

test_pipeline_cuda_placement_works_with_nf4Method · 0.72

test_device_mapMethod · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…