Data Indexing
We have to follow 5 steps for data indexing
- Step 1: Getting the uploaded PDF file
- Step 2: Convert the uploaded PDF to text
- Step 3: Split the extracted text into chunks
- Step 4: Generate embeddings from chunks
- Step 5: Delete the uploaded PDF after processing
Let us create a route i.e api/v1/pdf/indexing/new with post method for pdf indexing
Step 1: Getting the uploaded PDF file
At first, we will be converting PDF to Text using multer library. Multer is a node.js middleware for handling form-data, which is primarily used for uploading files.
Install multer from the terminal:
npm install multer
Import multer, then setup disk storage for PDF storing. In disk storage we will be mentioning two things destination & filename of the PDF.
Step 2: Convert the uploaded PDF to text
Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.
npm install pdf-parse
At first, read the pdf into buffer then we will parse the buffer to extract the plain text.
Step 3: Split the extracted text into chunks
The next step is converting the text into chunks so that the computer (AI model) can understand it more effectively.
Imagine you're reading a book with 320 pages. One option is to read all the pages in one go, while another is to read one page at a time. Which option makes more sense?
Obviously, reading one page, then pausing for a while before reading the next, is more manageable and effective. AI models operate in a similar way. Understanding a small portion of text is much easier than trying to process a large amount at once. This process is known as chunking.
There are 3 chunking strategies mentioned below:
- Fixed window size without token overlapping
- Fixed window size with token overlapping
- Page wise (Can be useful for presentation)
Step 4: Generate embeddings from chunks
There are 2 steps involved in this task:
- Creating the vector embedding from the chunks (text-embedding-3-small)
- Storing the embedding in vector database
Let's us first convert the vector embedding from the chunks
Install Required Packages
npm install openai dotenv
Create a .env file in root directory
OPENAI_API_KEY=your_openai_api_key_here
Set Up OpenAI Client
const OpenAI = require('openai');
require('dotenv').config();
const openaiClient = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
Create Embeddings from Chunks
chunks.forEach(async(chunk, index)=>{
const vector_embedding_result = await openaiClient.embeddings.create({
model: 'text-embedding-3-small',
input: chunk
});
console.log(vector_embedding_result?.data[0]?.embedding)
})
The embedding will look like this [-0.0123, 0.0987, 0.0432, -0.0714, 0.0023, …, 0.0019]