From Markdown to Machine Learning: Automating RAG Database Creation for Enhanced LLM Performance

Thursday, July 11, 2024

Retrieval-Augmented Generation (RAG) is a powerful technique for enhancing Large Language Models (LLMs) with custom, up-to-date information. Integrating RAG into LLM workflows allows organizations to leverage their proprietary data to generate more accurate, relevant, and contextually appropriate responses.

This approach bridges the gap between the LLM’s pre-trained knowledge and specific. This includes confidential information that is crucial for many business applications. In this article, I’ll explore the process of building a dynamic RAG system in Python. I’ll focus on efficient document loading, sentence embedding, and the creation of a searchable vector database. This article covers key steps including loading and processing Markdown files, splitting documents into manageable chunks, generating sentence embeddings, and storing these embeddings in a Chroma database. By the end of this post, you’ll have a solid understanding of how to create a flexible, automated RAG system that can significantly enhance the capabilities of your LLM-powered applications, allowing them to tap into your organization’s unique knowledge base.

I wrote this code specifically to help with improving a workflow where we had numerous reports written after a theatrical stage production. With one report per show, the number of reports and the info in them adds up quickly. The RAG vector database the LLM retrieves this information. Now, anyone, even with little technical expertise, can access find trends and gain insights on the production run almost in real time, all from a web browser chat style interface.

Loading Documents

The code snippet below demonstrates an efficient approach to loading multiple Markdown files from a specified directory for further processing in a RAG (Retrieval-Augmented Generation) system. After importing the needed frameworks, I structured the code around two main functions: read_markdown_file() and get_file_paths(). I then call the functions from the last three lines in this section of the code.

Locate The Files

The get_file_paths() function is where the power of the glob module comes into play. This function takes a directory path and a list of file patterns as input. It then uses glob.glob() to find all files in the specified directory that match any of the given patterns. The function glob is flexible and efficient at file matching based on wildcard patterns. This makes it easy to select files with specific prefixes or extensions.

A key benefit of using glob is the ability to handle multiple file patterns efficiently. In the example, three patterns are used: ‘stage_.md’, ‘report_.md’, and ‘control_*.md’. This approach makes it easy to categorization and select different types of reports or documents. This is crucial when organizing and processing large amounts of data for a RAG system.

Read The Files

The read_markdown_file() function opens a given file path (which we obtained from the previous function), reads its content, and returns it as a string. This function handles Markdown files specifically, using UTF-8 encoding to ensure proper handling of special characters. I also have versions of this that import CSV files, PDF files, and standard .txt files. By setting this up as a separate function, the code promotes reusability and cleaner organization.

The potential for automating the regeneration of the RAG database is significant with this approach. By running this app from a Crontab, the runs automatically on a regular basis. Each run imports any of the new files and adds them to the vector database.

This flexibility makes it straightforward to schedule regular updates to the RAG database. This ensures that the LLM always has access to the most recent and relevant information. Also, this structure allows for easy integration with automated workflows or continuous integration systems.

Sentence Embedding and RAG Creation

Sentence embedding is a powerful technique in natural language processing that transforms sentences into dense vector representations. This captures semantic meaning, allowing machines to understand and compare textual data in a more nuanced way than traditional keyword-based approaches.

In the next code snippet, we see the sentence embedding using the SentenceTransformer library. The process begins with splitting the documents into smaller chunks using the RecursiveCharacterTextSplitter. This step is crucial for maintaining context while keeping the text segments manageable for the embedding model. The chunk size of 1024 characters with a 64-character overlap strikes a balance between preserving local context and creating distinct text units.

The SentenceTransformer model is then initialized with a pre-trained model specified by EMBED_MODEL (this is a string constant defined in a separate file, for privacy and security purposes). This model converts each text chunk into a fixed-size vector representation. These embeddings capture the semantic essence of the text, allowing for sophisticated similarity comparisons and information retrieval. The embedding function is wrapped in a SentenceTransformerEmbeddings object, which is then used to create a Chroma vector database from the documents. I also have the full code on my github repository.

Benefits of RAG

The benefits of using sentence embeddings in a RAG system are numerous. Firstly, they allow for semantic search capabilities, where queries can retrieve relevant information based on meaning rather than exact keyword matches. This leads to more accurate and contextually appropriate results. Secondly, embeddings enable efficient similarity comparisons between different pieces of text, facilitating tasks like document clustering, recommendation systems, and duplicate detection. Lastly, by reducing text to dense vector representations, embeddings allow for faster processing and reduced storage requirements compared to working with raw text, especially in large-scale applications.




Ciao! I'm Scott Sullivan, a software engineer with a specialty in machine learning. I spend my time in the tranquil countryside of Lancaster, Pennsylvania, and northern Italy, visiting family, close to Cinque Terre and La Spezia. Professionally, I'm using my Master's in Data Analytics and my Bachelor's degree in Computer Science, to turn code into insights with Python, PyTorch and DFE superpowers while on a quest to create AI that's smarter than your average bear.