hub / github.com/huggingface/transformers / StaticCache

Class StaticCache

src/transformers/cache_utils.py:1510–1601 · view source on GitHub ↗

Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config` for potential hybrid cache structure, and initialize each layer accordingly. See `Cache` for details on common methods that are implemented by all cache classes. Args:

Source from the content-addressed store, hash-verified

1508
1509
1510	class StaticCache(Cache):
1511	"""
1512	Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config`
1513	for potential hybrid cache structure, and initialize each layer accordingly.
1514
1515	See `Cache` for details on common methods that are implemented by all cache classes.
1516
1517	Args:
1518	config (`PreTrainedConfig`):
1519	The config of the model for which this Cache will be used. It will be used to check for sliding
1520	or hybrid layer structure, and initialize each layer accordingly.
1521	max_cache_len (`int`):
1522	The maximum number of tokens that this Cache should hold.
1523	offloading (`bool`, optional, defaults to `False`):
1524	Whether to perform offloading of the layers to `cpu`, to save GPU memory.
1525	offload_only_non_sliding (`bool`, optional, defaults to `True`):
1526	If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because
1527	usually the sliding layers are small in size, so there is no need to offload them, and skipping it is faster).
1528
1529	Example:
1530
1531	```python
1532	>>> from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
1533
1534	>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
1535	>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
1536
1537	>>> inputs = tokenizer(text="My name is Llama", return_tensors="pt")
1538
1539	>>> # Prepare a cache class and pass it to model's forward
1540	>>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
1541	>>> max_generated_length = inputs.input_ids.shape[1] + 10
1542	>>> past_key_values = StaticCache(config=model.config, max_cache_len=max_generated_length)
1543	>>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
1544	>>> outputs.past_key_values # access cache filled with key/values from generation
1545	StaticCache()
1546	```
1547	"""
1548
1549	# Pass-in kwargs as well to avoid crashing for BC (it used more arguments before)
1550	def __init__(
1551	self,
1552	config: PreTrainedConfig,
1553	max_cache_len: int,
1554	offloading: bool = False,
1555	offload_only_non_sliding: bool = True,
1556	**kwargs,
1557	):
1558	config = config.get_text_config(decoder=True)
1559	layer_types = getattr(config, "layer_types", None)
1560	# If `layer_types` is not explicitly provided, infer if the model is fully sliding
1561	if layer_types is None:
1562	if getattr(config, "sliding_window", None) is not None:
1563	layer_types = ["sliding_attention" for _ in range(config.num_hidden_layers)]
1564	elif getattr(config, "attention_chunk_size", None) is not None:
1565	layer_types = ["chunked_attention" for _ in range(config.num_hidden_layers)]
1566	else:
1567	layer_types = ["full_attention" for _ in range(config.num_hidden_layers)]

Callers 15

run_benchmarkFunction · 0.90

test_static_cache_mha_mqa_gqaMethod · 0.90

test_cache_copyMethod · 0.90

test_static_cache_out_of_boundsMethod · 0.90

test_static_cacheMethod · 0.90

test_sliding_window_cacheMethod · 0.90

test_hybrid_cacheMethod · 0.90

test_hybrid_chunked_cacheMethod · 0.90

test_hybrid_chunked_cache_extra_casesMethod · 0.90

test_quantized_model_compileMethod · 0.90

_test_continuous_batching_parityMethod · 0.90

Calls

no outgoing calls

Tested by 15

test_static_cache_mha_mqa_gqaMethod · 0.72

test_cache_copyMethod · 0.72

test_static_cache_out_of_boundsMethod · 0.72

test_static_cacheMethod · 0.72

test_sliding_window_cacheMethod · 0.72

test_hybrid_cacheMethod · 0.72

test_hybrid_chunked_cacheMethod · 0.72

test_hybrid_chunked_cache_extra_casesMethod · 0.72

test_quantized_model_compileMethod · 0.72

_test_continuous_batching_parityMethod · 0.72

test_init_static_cache_multi_acceleratorMethod · 0.72