MCPcopy
hub / github.com/langchain-ai/langchain / ingest_documents

Function ingest_documents

templates/intel-rag-xeon/ingest.py:10–45  ·  view source on GitHub ↗

Ingest PDF to Redis from the data/ directory that contains Edgar 10k filings data for Nike.

()

Source from the content-addressed store, hash-verified

8
9
10def ingest_documents():
11 """
12 Ingest PDF to Redis from the data/ directory that
13 contains Edgar 10k filings data for Nike.
14 """
15 # Load list of pdfs
16 data_path = "data/"
17 doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
18
19 print("Parsing 10k filing doc for NIKE", doc)
20
21 text_splitter = RecursiveCharacterTextSplitter(
22 chunk_size=1500, chunk_overlap=100, add_start_index=True
23 )
24 loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
25 chunks = loader.load_and_split(text_splitter)
26
27 print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")
28
29 # Create vectorstore
30 embedder = HuggingFaceEmbeddings(
31 model_name="sentence-transformers/all-MiniLM-L6-v2"
32 )
33
34 documents = []
35 for chunk in chunks:
36 doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
37 documents.append(doc)
38
39 # Add to vectorDB
40 _ = Chroma.from_documents(
41 documents=documents,
42 collection_name="xeon-rag",
43 embedding=embedder,
44 persist_directory="/tmp/xeon_rag_db",
45 )
46
47
48if __name__ == "__main__":

Callers 1

ingest.pyFile · 0.70

Calls 7

DocumentClass · 0.90
appendMethod · 0.80
load_and_splitMethod · 0.45
from_documentsMethod · 0.45

Tested by

no test coverage detected