Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config` for potential hybrid cache structure, and initialize each layer accordingly. See `Cache` for details on common methods that are implemented by all cache classes. Args:
| 1508 | |
| 1509 | |
| 1510 | class StaticCache(Cache): |
| 1511 | """ |
| 1512 | Static Cache class to be used with `torch.compile(model)` and `torch.export()`. It will check the `config` |
| 1513 | for potential hybrid cache structure, and initialize each layer accordingly. |
| 1514 | |
| 1515 | See `Cache` for details on common methods that are implemented by all cache classes. |
| 1516 | |
| 1517 | Args: |
| 1518 | config (`PreTrainedConfig`): |
| 1519 | The config of the model for which this Cache will be used. It will be used to check for sliding |
| 1520 | or hybrid layer structure, and initialize each layer accordingly. |
| 1521 | max_cache_len (`int`): |
| 1522 | The maximum number of tokens that this Cache should hold. |
| 1523 | offloading (`bool`, *optional*, defaults to `False`): |
| 1524 | Whether to perform offloading of the layers to `cpu`, to save GPU memory. |
| 1525 | offload_only_non_sliding (`bool`, *optional*, defaults to `True`): |
| 1526 | If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because |
| 1527 | usually the sliding layers are small in size, so there is no need to offload them, and skipping it is faster). |
| 1528 | |
| 1529 | Example: |
| 1530 | |
| 1531 | ```python |
| 1532 | >>> from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache |
| 1533 | |
| 1534 | >>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") |
| 1535 | >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") |
| 1536 | |
| 1537 | >>> inputs = tokenizer(text="My name is Llama", return_tensors="pt") |
| 1538 | |
| 1539 | >>> # Prepare a cache class and pass it to model's forward |
| 1540 | >>> # Leave empty space for 10 new tokens, which can be used when calling forward iteratively 10 times to generate |
| 1541 | >>> max_generated_length = inputs.input_ids.shape[1] + 10 |
| 1542 | >>> past_key_values = StaticCache(config=model.config, max_cache_len=max_generated_length) |
| 1543 | >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True) |
| 1544 | >>> outputs.past_key_values # access cache filled with key/values from generation |
| 1545 | StaticCache() |
| 1546 | ``` |
| 1547 | """ |
| 1548 | |
| 1549 | # Pass-in kwargs as well to avoid crashing for BC (it used more arguments before) |
| 1550 | def __init__( |
| 1551 | self, |
| 1552 | config: PreTrainedConfig, |
| 1553 | max_cache_len: int, |
| 1554 | offloading: bool = False, |
| 1555 | offload_only_non_sliding: bool = True, |
| 1556 | **kwargs, |
| 1557 | ): |
| 1558 | config = config.get_text_config(decoder=True) |
| 1559 | layer_types = getattr(config, "layer_types", None) |
| 1560 | # If `layer_types` is not explicitly provided, infer if the model is fully sliding |
| 1561 | if layer_types is None: |
| 1562 | if getattr(config, "sliding_window", None) is not None: |
| 1563 | layer_types = ["sliding_attention" for _ in range(config.num_hidden_layers)] |
| 1564 | elif getattr(config, "attention_chunk_size", None) is not None: |
| 1565 | layer_types = ["chunked_attention" for _ in range(config.num_hidden_layers)] |
| 1566 | else: |
| 1567 | layer_types = ["full_attention" for _ in range(config.num_hidden_layers)] |
no outgoing calls