Back to Articles

Data Indexing in RAG

Data Indexing

We have to follow 5 steps for data indexing

  1. Step 1: Getting the uploaded PDF file
  2. Step 2: Convert the uploaded PDF to text
  3. Step 3: Split the extracted text into chunks
  4. Step 4: Generate embeddings from chunks
  5. Step 5: Delete the uploaded PDF after processing

Let us create a route i.e api/v1/pdf/indexing/new with post method for pdf indexing

Step 1: Getting the uploaded PDF file

At first, we will be converting PDF to Text using multer library. Multer is a node.js middleware for handling form-data, which is primarily used for uploading files.

Install multer from the terminal:

npm install multer

Import multer, then setup disk storage for PDF storing. In disk storage we will be mentioning two things destination & filename of the PDF.

Step 2: Convert the uploaded PDF to text

Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.

npm install pdf-parse

At first, read the pdf into buffer then we will parse the buffer to extract the plain text.

Step 3: Split the extracted text into chunks

The next step is converting the text into chunks so that the computer (AI model) can understand it more effectively.

Imagine you're reading a book with 320 pages. One option is to read all the pages in one go, while another is to read one page at a time. Which option makes more sense?

Obviously, reading one page, then pausing for a while before reading the next, is more manageable and effective. AI models operate in a similar way. Understanding a small portion of text is much easier than trying to process a large amount at once. This process is known as chunking.

There are 3 chunking strategies mentioned below:

  1. Fixed window size without token overlapping
  2. Fixed window size with token overlapping
  3. Page wise (Can be useful for presentation)

Step 4: Generate embeddings from chunks

There are 2 steps involved in this task:

  1. Creating the vector embedding from the chunks (text-embedding-3-small)
  2. Storing the embedding in vector database

Let's us first convert the vector embedding from the chunks

Install Required Packages

npm install openai dotenv

Create a .env file in root directory

OPENAI_API_KEY=your_openai_api_key_here

Set Up OpenAI Client

const OpenAI = require('openai');
require('dotenv').config();
const openaiClient = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Create Embeddings from Chunks

chunks.forEach(async(chunk, index)=>{
  const vector_embedding_result = await openaiClient.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunk
  });
  console.log(vector_embedding_result?.data[0]?.embedding)
})

The embedding will look like this [-0.0123, 0.0987, 0.0432, -0.0714, 0.0023, …, 0.0019]