MCPcopy
hub / github.com/langchain-ai/langchain / index

Function index

libs/core/langchain_core/indexing/api.py:194–397  ·  view source on GitHub ↗

Index data from the loader into the vector store. Indexing functionality uses a manager to keep track of which documents are in the vector store. This allows us to keep track of which documents were updated, and which documents were deleted, which documents should be skipped.

(
    docs_source: Union[BaseLoader, Iterable[Document]],
    record_manager: RecordManager,
    vector_store: VectorStore,
    *,
    batch_size: int = 100,
    cleanup: Literal["incremental", "full", None] = None,
    source_id_key: Union[str, Callable[[Document], str], None] = None,
    cleanup_batch_size: int = 1_000,
    force_update: bool = False,
)

Source from the content-addressed store, hash-verified

192
193
194def index(
195 docs_source: Union[BaseLoader, Iterable[Document]],
196 record_manager: RecordManager,
197 vector_store: VectorStore,
198 *,
199 batch_size: int = 100,
200 cleanup: Literal["incremental", "full", None] = None,
201 source_id_key: Union[str, Callable[[Document], str], None] = None,
202 cleanup_batch_size: int = 1_000,
203 force_update: bool = False,
204) -> IndexingResult:
205 """Index data from the loader into the vector store.
206
207 Indexing functionality uses a manager to keep track of which documents
208 are in the vector store.
209
210 This allows us to keep track of which documents were updated, and which
211 documents were deleted, which documents should be skipped.
212
213 For the time being, documents are indexed using their hashes, and users
214 are not able to specify the uid of the document.
215
216 IMPORTANT:
217 * if auto_cleanup is set to True, the loader should be returning
218 the entire dataset, and not just a subset of the dataset.
219 Otherwise, the auto_cleanup will remove documents that it is not
220 supposed to.
221 * In incremental mode, if documents associated with a particular
222 source id appear across different batches, the indexing API
223 will do some redundant work. This will still result in the
224 correct end state of the index, but will unfortunately not be
225 100% efficient. For example, if a given document is split into 15
226 chunks, and we index them using a batch size of 5, we'll have 3 batches
227 all with the same source id. In general, to avoid doing too much
228 redundant work select as big a batch size as possible.
229
230 Args:
231 docs_source: Data loader or iterable of documents to index.
232 record_manager: Timestamped set to keep track of which documents were
233 updated.
234 vector_store: Vector store to index the documents into.
235 batch_size: Batch size to use when indexing.
236 cleanup: How to handle clean up of documents.
237 - Incremental: Cleans up all documents that haven't been updated AND
238 that are associated with source ids that were seen
239 during indexing.
240 Clean up is done continuously during indexing helping
241 to minimize the probability of users seeing duplicated
242 content.
243 - Full: Delete all documents that have not been returned by the loader
244 during this run of indexing.
245 Clean up runs after all documents have been indexed.
246 This means that users may see duplicated content during indexing.
247 - None: Do not delete any documents.
248 source_id_key: Optional key that helps identify the original source
249 of the document.
250 cleanup_batch_size: Batch size to use when cleaning up documents.
251 force_update: Force update documents even if they are present in the

Calls 15

_get_source_id_assignerFunction · 0.85
_batchFunction · 0.85
listFunction · 0.85
appendMethod · 0.80
_deduplicate_in_orderFunction · 0.70
lazy_loadMethod · 0.45
loadMethod · 0.45
get_timeMethod · 0.45
from_documentMethod · 0.45
existsMethod · 0.45
addMethod · 0.45
to_documentMethod · 0.45