MCPcopy
hub / github.com/huggingface/transformers / StaticCache

Class StaticCache

src/transformers/cache_utils.py:1510–1601  ·  view source on GitHub ↗

Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config` for potential hybrid cache structure, and initialize each layer accordingly. See `Cache` for details on common methods that are implemented by all cache classes. Args:

Source from the content-addressed store, hash-verified

1508
1509
1510class StaticCache(Cache):
1511 """
1512 Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config`
1513 for potential hybrid cache structure, and initialize each layer accordingly.
1514
1515 See `Cache` for details on common methods that are implemented by all cache classes.
1516
1517 Args:
1518 config (`PreTrainedConfig`):
1519 The config of the model for which this Cache will be used. It will be used to check for sliding
1520 or hybrid layer structure, and initialize each layer accordingly.
1521 max_cache_len (`int`):
1522 The maximum number of tokens that this Cache should hold.
1523 offloading (`bool`, *optional*, defaults to `False`):
1524 Whether to perform offloading of the layers to `cpu`, to save GPU memory.
1525 offload_only_non_sliding (`bool`, *optional*, defaults to `True`):
1526 If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because
1527 usually the sliding layers are small in size, so there is no need to offload them, and skipping it is faster).
1528
1529 Example:
1530
1531 ```python
1532 >>> from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
1533
1534 >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
1535 >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
1536
1537 >>> inputs = tokenizer(text="My name is Llama", return_tensors="pt")
1538
1539 >>> # Prepare a cache class and pass it to model's forward
1540 >>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate
1541 >>> max_generated_length = inputs.input_ids.shape[1] + 10
1542 >>> past_key_values = StaticCache(config=model.config, max_cache_len=max_generated_length)
1543 >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
1544 >>> outputs.past_key_values # access cache filled with key/values from generation
1545 StaticCache()
1546 ```
1547 """
1548
1549 # Pass-in kwargs as well to avoid crashing for BC (it used more arguments before)
1550 def __init__(
1551 self,
1552 config: PreTrainedConfig,
1553 max_cache_len: int,
1554 offloading: bool = False,
1555 offload_only_non_sliding: bool = True,
1556 **kwargs,
1557 ):
1558 config = config.get_text_config(decoder=True)
1559 layer_types = getattr(config, "layer_types", None)
1560 # If `layer_types` is not explicitly provided, infer if the model is fully sliding
1561 if layer_types is None:
1562 if getattr(config, "sliding_window", None) is not None:
1563 layer_types = ["sliding_attention" for _ in range(config.num_hidden_layers)]
1564 elif getattr(config, "attention_chunk_size", None) is not None:
1565 layer_types = ["chunked_attention" for _ in range(config.num_hidden_layers)]
1566 else:
1567 layer_types = ["full_attention" for _ in range(config.num_hidden_layers)]

Calls

no outgoing calls