Summary

Project Overview

Core Objective

To build a Retrieval-Augmented Generation (RAG) system that bridges the gap between raw federal regulations (FRA manuals) and practical, field-level maintenance queries.

Primary Goal:

Transition from “black-box” AI responses to verifiable, grounded, and explainable safety auditing.

Technical Architecture

The “Brain” (LLM):
Utilizes a fine-tuned Llama-3.2-3B-Instruct model, optimized with 4-bit quantization (BitsAndBytes) and Unsloth to run efficiently on consumer-grade hardware like a T4 GPU.
The “Librarian” (Vector Store):
Powered by ChromaDB using BAAI/bge-base-en-v1.5 embeddings for semantic indexing of over 2,600 document chunks.
Multi-Stage Retrieval:
Employs a sophisticated funnel starting with Hybrid Search (lexical BM25 + semantic vector) followed by a Cross-Encoder Reranker.

Retrieval Performance Milestones

Accuracy Boost: Moving from simple Vector search to a Hybrid-Rerank architecture increased Hit@1 accuracy by 116% (from 0.216 to 0.467).
The “Instruct” Breakthrough: Adding “Query” and “Passage” prefixes to the BGE-reranker boosted Hit@1 to 0.520 and MRR to 0.619.
Reliability: Achieved a Hit@5 of 0.760, ensuring the correct regulation is in the model’s context 3 out of 4 times.

Key Engineering Insights

The Paradox of Context: Discovered that increasing the initial candidate pool ($n_{initial}$) beyond 25 introduced “Hard Negatives” (e.g., Signal manual rules “poisoning” Hazmat queries), which actually degraded MRR.
Metadata Filtering: Validated that source filtering is mandatory in specialized domains to prevent semantic overlap between different regulatory universes.
Behavioral Fine-Tuning: Used QLoRA to train the model on a 4-Phase Thinking Process (Audit, Map, Synthesize, Verify) to enforce structural reasoning rather than just factual recall.Evaluation Strategy
Synthetic Data Generation: Used a “Teacher” model (Gemini) to generate a technical golden set of questions and reasoning chains from raw manual excerpts.
LLM-as-a-Judge: Implemented a secondary high-level auditor to grade the system on Faithfulness, Regulatory Accuracy, and Citation Quality .

Project Roadmap

8-Week LLM Project Roadmap (Provided by Gemini Pro)

Goal

Build a “Rail” RAG system that bridges the gap between raw federal regulations and field-level maintenance queries.

Personal Goals

Build proficiency in the core architecture of LLMs (e.g., transformers and attention mechanisms)
Develop experience with prompt engineering techniques, including chain-of-thought prompting, Retrieval-Augmented Generation (RAG), fine-tuning of language models, and evaluation methodologies.
Gain experience with Vector Databases and embeddings.
Experiment with Model Fine-tuning.
Experiment with GPU optimization.

Phase 0: Data and Basic Theory

Data

I’ve downloaded 5 publicly available FRA PDF’s to use as training data.

HRGC handbook (2019)
Hazardous Material Compliance Manual (2025)
Signal & Train Control All Chapters (2012)
HRGC & Trespass Prevention: Compliance, Procedures & Program Manual (2019)
Track Structures CM (2026) Vol III Ch1: Railroad Workplace Safety General

Git Repo

Any code I wrote is available publicly here : https://github.com/jac0bmath3w/rail-safety-ai. Please note that any code snippets in this blog may not be up to date.

Summary of the Paper

This is my understanding of the transformers, attention, and what happens inside it. I read the model Architecture (Section 3 of Attention is All You Need paper avaialble here, referred to as ‘the paper’ going forward in this blog) and viewed the videos by 3Blue1Brown & Andrej Karpathy to nail in my understanding of it.

Videos by 3Blue1Brown

Video by Andrej Karpathy - Let’s build GPT: from scratch, in code, spelled out.

The transformer has an encoder part as well as a decoder part. But in this project, I’m only interested in the decoder part of the transformer. The steps involved are as follows:

We embed the tokens (text being read is broken into tokens). A token could be a word, a part of a word, a special character, etc. Embedding a token means representing a token in a huge dimension. (Say, 512 as given in the paper. It could be more).
We add a position component for each token. Otherwise the model doesn’t know what token appears where in the string.
Attention Block

3.1 Each token has a query and key vector ($Q$ and $K$ respectively) that is obtained by multiplying the $W_q$ and $W_k$ respectively with the embedding.
3.2 We take the dot product of the $Q$ and $K$ vectors to get a score. If the key of token 1 aligns with the query of token 2, that means the embedding of token 1 “attends” to the embedding of token 2. This results in a large value for the score, essentially meaning token 1 “has something to say about” token 2.
3.3 We divide this by the square root of the dimension of the embedding to prevent the dot product from growing too large.
3.4 We mask the score of future words by setting the value to $-\infty$. This is done so the model doesn’t “see the answer.” For example, in the string “I read the paper and it was interesting,” once we reach “I read the paper,” we don’t want the model to see the next word. (Note: Masking is not done in an encoder). We use $-\infty$ because the softmax function will then turn that value into $0$.
3.5 We apply the softmax function to get the probabilities.
3.6 We calculate the value vector by multiplying the embedding with $W_v$ (Value weight matrix).
3.7 We calculate the weighted sum of the value vector with the score.
3.8 This is repeated 6 times (according to the original paper), or $n_{layer}$ times sequentially.

All the elements in the attention block is repeated in parallel in different heads. We use n_heads parameter to determine how many heads we want. All the results from the n_heads heads is concatenated into one vector. This is multiplied by a final weight matrix to fuse them into a single d_k dimensional vector.
Add & Norm Layer

5.1 This result is added to the original embedding to get the modified embedding. This is called the ‘add layer’ or the ‘residual connection’
5.2 To keep the activations within a consistent range, we add a norm layer. It could be RMSNorm

We have a MLP at the next level where each of the embedding vectors is paralelly processed.
We add another Add & Norm layer as described before
We add a last linear layer
Softmax to get the probability

Note: Since I am not talking about the encoder here, I skipped the cross attention part.

Other items

Temperature

This is a parameter you divide the logits by before the softmax in the attention block. \[ \text{Softmax with Temperature} = \frac{e^{\frac{x_i}{T}}}{\Sigma_{n=0}^{N-1}e^\frac{{x_n}}{T}} \]

Low Temp (0.1): Makes the highest probability even higher (very focused).
High Temp (1.0+): Flattens the distribution (more random/creative).

For FRA manuals, you’ll want a low temperature because you want facts, not “creative” safety rules.

Phase 1: Building the ingestion engine, embeddings and vector stores.

The latest version of the code would be available at https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/ingest.py

Ingestion

I create a Class called RailDocumentProcessor. I initialize the class using the splitter tool so that the RecursiveCharacterTextSplitter is just called once when an object of the class is initialized (and not for every document). This tool uses parameters chunk_size = 1000, and chunk_overlap = 150. I used chunk_size of 1000 to not lose context of a specific railway rule. Also use an overlap to ensure that any ‘bridge’ between chunks is not lost. Also note: ‘len’ is a built in python function that calculates the length of an object. Say we were using tokens, we would update ‘len’ with ‘tokenizer.len’.

Why use RecursiveCharacterTextSplitter as opposed to CharacterTextSplitter? RecursiveCharacterTextSplitter splits text intelligently by natural boundaries (paragraphs, sentences ,spaces) to keep semantic context intact. CharacterTextSplitter splits simply by a fixed character count, which often breaks words and ruins context.

We use a chunk_size instead of processing the entire document due to context window. LLMs have a limit on how many tokens it can see. The higher the context, the number of parameters that a model has to learn explodes.

This class has the process_directory() method. This goes thru each .pdf file in the directory, extracts texts using the pypdf directory, and then creates chunks and creates chunk using the splitter tool. At the same time it saves metadata for each of the chunk including the source (i.e. the filename), the page number, and the length of character for that chunk.

import pypdf
import os
from langchain_text_splitters import RecursiveCharacterTextSplitter

class RailDocumentProcessor:
    def __init__(self, chunk_size=1000, chunk_overlap=150):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len
        )

    def process_directory(self, directory_path):
        """Processes all PDFs and returns a tuple: (list_of_chunks, list_of_metadatas)."""
        all_chunks = []
        all_metadatas = []

        if not os.path.exists(directory_path):
            print(f"Error: Directory {directory_path} not found.")
            return [], []

        for filename in os.listdir(directory_path):
            if filename.endswith(".pdf"):
                print(f"Processing: {filename}")
                path = os.path.join(directory_path, filename)
                
                try:
                    reader = pypdf.PdfReader(path)
                    for i, page in enumerate(reader.pages):
                        page_text = page.extract_text()
                        if not page_text:
                            continue
                        
                        # Create chunks for this specific page
                        page_chunks = self.splitter.split_text(page_text)
                        
                        for chunk in page_chunks:
                            all_chunks.append(chunk)
                            # Enhanced metadata for citations and filtering
                            all_metadatas.append({
                                "source": filename,
                                "page": i + 1,
                                "char_count": len(chunk)
                            })
                except Exception as e:
                    print(f"Could not process {filename}: {e}")
                        
        return all_chunks, all_metadatas

For the 5 PDFs, this extracted 2633 chunks. This was surprising to me to see ‘only’ ~3000 chunks. But \[\text{Total Characters} \approx \text{Chunks} \times \text{Chunk Size}\] which gives \[3,000 \times 1,000 = 3,000,000 \text{ characters}\] Even though the 5 PDF files were approximately 25MB, only about 2MB of it is useful text data.

Embedding

The chunks obtained at the end of the ingestion is sent to the embedder. This is available at https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/embed.py

from sentence_transformers import SentenceTransformer
import torch

class RailEmbedder:
    def __init__(self, model_name='BAAI/bge-base-en-v1.5'):
        # Check if GPU is available
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f"Initializing Embedder on: {self.device}")
        
        self.model = SentenceTransformer(model_name, device=self.device)

    def generate_embeddings(self, text_chunks):
        embeddings = self.model.encode(text_chunks, show_progress_bar=True, convert_to_tensor=False)
        return embeddings

For this purpose, I defined a class called RailEmbedder and initialized with the sentence transformers model called ‘BAAI/bge-base-en-v1.5’. This is available at https://huggingface.co/BAAI/bge-base-en-v1.5. Other options for the sentence transformer model available include: all-mpnet-base-v2, all-MiniLM-L6-v2 (people on the internet say this is dated and not to use it).

This class has a method called generate_embeddings(), which takes as argument the text_chunks. This creates an embedding for each chunk in dimensions. A matrix of size 2633x768 (768 is the dimension of the sentence transform model BAAI/bge-base-en-v1.5) is obtaned as a return value of generate_embedding. If GPU is available, it uses that else it uses CPU (much slower). Earlier, I mentioned that each token has a vector embedding. Each token is approximately 4 characters (source). So why do we have 2633 vectors at the end of generate_embeddings instead of 2633 (chunks) X 250 (1000 chunk_size / 4 charaacter per token) X 768? This is because of pooling. What happens during model.encode is

Tokenizes the chunk into $N$ tokens.
Generates $N$ vectors (one for every token).
Collapses those $N$ vectors into one single vector that represents the average meaning of the whole chunk.

What does convert_to_tensor argument in model.encode() do?

convert_to_tensor=True:
The model returns a PyTorch Tensor. If you are using a GPU, this tensor remains in the VRAM (Video RAM). This is extremely fast if you are immediately passing that data into another neural network layer (like for training).
convert_to_tensor=False:
The model returns a NumPy array. This data is moved from the GPU back to the System RAM (CPU memory). Vector databases (chromadb is used for this project) expects the data in standard format (NumPy). For storage (which is the next step), I want it to be false.

TODO:Add a filter to the RailDocumentProcessor class: “If a chunk is less than 100 characters or contains more than 50% periods/dots (like a TOC), discard it.”
TODO:Add a list of already processed files to ensure this can scale if more PDFs are added.

Vector Store

This class acts as the “Librarian” of the system, and handles storage, indexing and retrieval. The embeddings generated by RailEmbedder are stored in the disk using this class. I use chromadb library to store these vectors. I created a class RailVectorVault for this purpose available at https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/vector_store.py. I pass the embedder_instance of class RailEmbedder and create a collection.

import chromadb

class RailVectorVault:
    def __init__(self, embedder_instance, db_path="./vector_db", collection_name = "rail_safety"):
        # We pass the embedder IN. This is called 'Dependency Injection'.
        self.embedder = embedder_instance 
        self.client = chromadb.PersistentClient(path=db_path)
        self.collection = self.client.get_or_create_collection(name=collection_name)

    def add_documents(self, chunks, metadatas):
        # The Vault asks the Embedder to do its job
        vectors = self.embedder.generate_embeddings(chunks)
        ids = [f"id_{i}" for i in range(len(chunks))]
        
        self.collection.add(
            documents=chunks,
            embeddings=vectors.tolist(),
            metadatas=metadatas,
            ids=ids
        )
    
    def query(self, question, n_results=3):
        """
        Performs a semantic search.
        1. Embeds the question using the injected embedder.
        2. Queries ChromaDB for the closest matches.
        """
        # Embed the query string
        query_vector = self.embedder.generate_embeddings([question])
        
        # Search the collection
        results = self.collection.query(
            query_embeddings=query_vector.tolist(),
            n_results=n_results
        )
        return results

The method add_documents() takes the chunks, embeddings, and the metadata and stores it in drive. The method query 1. Turns the question into a vector. 2. Finds the $n_results$ most similar chunks. 3. Returns a dictionary containing the text and the metadata.

Phase 2: RAG Pipeline

The Reasoning Engine

I defined the class RailSafetyEngine which serves as the reasoning brain of the RAG system. When you query the vector database, it returns n_results number of text that closely match the embeddings that we stored in the database. Think of it as the librarian returning the relevant books that you need to answer the query. Then the brain processes them to return the answer to the question. I use the following imports from the transformers library

AutoModelForCausalLM
- A class that loads the correct model architecture based on the model that you pass (example llama or gemma)
AutoTokenizer
- Handles the translation between human text and numerical values that are fed to a machine.
pipeline
- Wrapper for the tokenization, model execution, and decoding into one function call.
BitsAndByesConfig
- This helps to compress the model so that it fits in the T4 GPU memory (used in colab).

When this class is initialzed, the device is mapped (cpu or gpu), the quantization is set up (bnb_config), loads the model (using scaled dot product attention implementation, shortened as sdpa), and creates the pipeline. ‘sdpa’ and 4-bit quantization are the “Optimization Duo” that allows a model meant for a $20,000 server to run on a free Google Colab instance.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
import torch

class RailSafetyEngine:
    def __init__(self, model_id="unsloth/Llama-3.2-3B-Instruct"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        
        # Ensure a padding token exists
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config,
            device_map="auto",
            attn_implementation="sdpa"
        )
        
        self.pipe = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            max_new_tokens=512,
            temperature=0.1, 
            device_map="auto"
        )

    def generate_answer(self, question, context_chunks):
        if not context_chunks:
            return "No relevant safety manual excerpts were found."

        context_text = "\n\n".join(context_chunks)
        
        # 1. Structure the data as a standard list of messages
        messages = [
            {
                "role": "system", 
                "content": "You are a Senior FRA Safety Consultant. Use ONLY the provided context. Cite Page numbers."
            },
            {
                "role": "user", 
                "content": f"CONTEXT:\n{context_text}\n\nQUESTION: {question}"
            }
        ]
        
        # 2. Use the model-agnostic chat template
        # tokenize=False returns a string; add_generation_prompt=True adds the 'assistant' header
        prompt = self.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
        )
        
        # 3. Generate
        outputs = self.pipe(
            prompt, 
            do_sample=True,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.pad_token_id
        )
        
        # 4. Extract the response
        # Most tokenizers allow you to split by the prompt itself to get just the new text
        return outputs[0]["generated_text"][len(prompt):].strip()

This class has the generate_answer() method, which takes the question, and the text returned (by the librarian), and returns the answer. Some prompt engineering is involved to make the prompt as specific as possible for the model but at the same time, it is a general wrapper for several questions. The table below shows some iterations that I went thru to get to my current result.

Phase	Problem Encountered	Solution	Result
01: The “Dumb” Retriever	The AI kept talking about removing bridges when I asked about creating new crossings.	Metadata Awareness: We moved from flat text to structured chunks that know their File and Page.	Better citations, but still missing key technical data.
02: The “Blind” Brain	The AI missed the specific numbers on Page 140 (AADT, speeds) because the search didn’t find that page.	Increasing $n_{result}$ (Retrieval Depth): I bumped $n_{result}$ from 3 to 15 to ensure the “Librarian” checked more “shelves.”	The model finally “saw” the specific thresholds in the manual.
03: The “Memory Leak”	The GPU RAM kept climbing to 12GB+ and hanging for 10 minutes.	Quantization & Cache Clearing: We used 4-bit BitsAndBytes and added torch.cuda.empty_cache() to “reset” the brain.	Inference time dropped from 10 mins to under 60 seconds.
04: The “Brittle” Prompt	Using hardcoded tags like meant we couldn’t easily switch to newer models like Gemma.	Chat Templating: Switched to model-agnostic templates to allow for easy model swapping.	Portability across different LLMs.
05: The “Black Box”	The AI couldn’t find the author names (Brent Ogden/Kimley-Horn) because they were on the cover.	Strategic Context: We learned that “Admin” questions need different retrieval strategies (like looking at Page 1).	Higher professional accuracy.
06: The Reasoning Gap	The AI’s logic engine wasn’t strong enough to weigh the “OR” triggers of the prompt	Chain of Thought Prompting	Correct answers to reasoning type questions.

Example queries to test current state of the engine.

Query 1

What enforcement tools are available when enforcing the hazardous materials regulations?

Answer

[THINKING PROCESS]

PHASE 1: CONTEXTUAL AUDIT

The user question pertains to the enforcement tools available when enforcing the Hazardous Materials Regulations. This falls under the scope of Chapter 10 – Enforcement and Compliance Program in the FRA-Hazardous Materials Compliance Manual.

PHASE 2: EVIDENCE MAPPING

From the provided manuals, we can extract the following relevant information:

* Section 10.3 Enforcement Tools (Source 9, Page 27)
* Section 10.3.1 Defect Notices (Source 9, Page 27)
* Section 10.3.2 Violation with Civil Monetary Penalties (Source 9, Page 27)
* Section 10.3.3 Enhanced Penalties (Source 9, Page 28)
* Section 10.3.4 Railworthiness Directives (Source 9, Page 28)
* Section 10.3.5 Compliance Orders (Source 9, Page 28)
* Section 10.3.6 Emergency Orders (Source 9, Page 28)

PHASE 3: SYNTHESIS

Based on the extracted information, the available enforcement tools when enforcing the Hazardous Materials Regulations are:

1. Defect Notices
2. Violation with Civil Monetary Penalties
3. Enhanced Penalties
4. Railworthiness Directives
5. Compliance Orders
6. Emergency Orders

These tools can be used by FRA inspectors to address noncompliance with the Hazardous Materials Regulations, ranging from less stringent measures like Defect Notices to more severe actions like Emergency Orders.

PHASE 4: VERIFICATION

All the extracted information is backed by specific citations from the provided manuals.

ANSWER

The available enforcement tools when enforcing the Hazardous Materials Regulations are:

1. Defect Notices
2. Violation with Civil Monetary Penalties
3. Enhanced Penalties
4. Railworthiness Directives
5. Compliance Orders
6. Emergency Orders

Query 2

A rural highway-rail grade crossing has these conditions:

- posted highway speed: 60 mph
- AADT: 18,500
- freight trains per day: 28
- maximum authorized train speed: 70 mph
- expected accident frequency with gates: 0.3 per year
- vehicle delay: 35 vehicle-hours per day
- acceptable alternate access exists within 0.8 mile
- closing the crossing would increase the median trip by 2.2 miles

Based on the Handbook’s Chapter 3 guidance, which of the following is best supported?

A. The crossing should primarily be considered for closure only
B. The crossing should primarily be considered for grade separation
C. Neither closure nor grade separation is supported
D. Both closure and grade separation are supported by the listed criteria

Answer

[THINKING PROCESS]

PHASE 1: CONTEXTUAL AUDIT

- The query is related to the decision-making process for highway-rail grade crossings.
- The provided context is from the FRA-HRGC-Handbook-2019.

PHASE 2: EVIDENCE MAPPING

- Source 1: Grade separation is recommended for crossings with AADT > 30,000 or when posted highway speed > 55 mph.
- Source 2: Grade separation is justified if the benefit-to-cost ratio is < 1.0 or if there are no viable alternatives.
- Source 3: Grade separation is considered for crossings with AADT < 1,000, acceptable alternate access, and median trip length not increased by > 2.5 miles.
- Source 4: Grade separation is recommended for crossings with train speeds > 40 mph, AADT > 2,000, and multiple lanes of traffic.
- Source 5: Grade separation is considered for crossings with high accident frequencies, vehicle delays, and limited alternate access.
- Source 6: Closure is the first alternative to consider, followed by grade separation.
- Source 7: Closure provides the highest level of crossing safety.
- Source 8: Closure is considered for crossings with high train speeds (> 125 mph) and limited sight distances.
- Source 9: Vehicles must stop within 50 feet of the crossing before proceeding.
- Source 10: Barrier systems are required for high-speed trains (> 110 mph).
- Source 11: Closure is considered for crossings with insufficient sight distances and limited alternate access.
- Source 12: Sight distances, approach sight distances, and clearing sight distances are critical for determining the need for grade separation.
- Source 13: The NCDOT considers various factors when deciding whether to close or improve a crossing.
- Source 14: Grade separation is considered for crossings with high accident frequencies and limited alternate access.
- Source 15: The NCDOT considers various factors when deciding whether to close or improve a crossing.

PHASE 3: SYNTHESIS

- Based on the provided criteria, the crossing should be considered for grade separation due to the high AADT, posted highway speed, and maximum authorized train speed.
- The presence of freight trains per day and expected accident frequency with gates also support grade separation.
- However, the acceptable alternate access and median trip length increase do not strongly support grade separation.
- The NCDOT's guidelines and the FRA's regulations support the consideration of grade separation for the given conditions.

PHASE 4: VERIFICATION

- All points in the synthesis are supported by specific citations from the provided manuals.

ANSWER


B

TODO: Look at Few Shot Prompting (by providing example answer), Self-Correction System (a second pass where the AI reviews its own answer) and Multi-Agent Systems

At this point, the retrieval is working (getting appropriate chunks from the vector database) as well as the engine is generating good results. Still it could be improved. The latency of the model with query 2 was around 2 minutes. To get the correct answer, I had to send a massive system prompt with 15 context chunks. This could be brought down.

Phase 3.1 : Fine Tuning

There are several avenues where we could fine tune this system. They include tuning the chunk size, the overlap window, engineering the prompt, or the number of chunks retrieved. We could also tune the embedding model (the retriever), the re-ranker, or the generator LLM.

Training the embeddign model. This involves improving recall @ k, precision of first stage retrieval, robustnuss to paraphased question, and domain specific search quality
Training the re-ranker This looks at relevance, rather than just looking at the distance in the vector space.
Training the generator LLM Here we train the model things like how to answer using the retrieved context, citing source chunks, refusing when evidence is insufficient, following a desired answer format, synthesizing across multiple chunks.

I’m starting with training the generator first. To do this, we need to create ‘training data’. This would involve a list of queries, the thinking process associated with each query, and their answers. Ideally, it would be created by a panel of experts, but that is a costly endeavor. For this project, I’m making API calls to gemini 2.5 to do this for me. To do this I created a RailDataGenerator class available at https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/generator.py.

class RailDataGenerator:
    def __init__(self, vault_instance, api_url, api_key):
        self.vault = vault_instance
        self.api_url = api_url #"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-09-2025:generateContent"
        self.api_key = api_key

    def _call_teacher(self, system_prompt, user_query):
        payload = {
            "contents": [{"parts": [{"text": user_query}]}],
            "systemInstruction": {"parts": [{"text": system_prompt}]}
        }
        
        # Increased backoff steps to handle stricter 2026 rate limits
        for delay in [2, 4, 8, 16, 32]:
            try:
                # Added a 30s timeout to prevent the script from hanging forever
                response = requests.post(
                    f"{self.api_url}?key={self.api_key}", 
                    json=payload,
                    timeout=30 
                )
                
                if response.status_code == 200:
                    result = response.json()
                    return result.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', "")
                elif response.status_code == 429:
                    print(f"Rate limit hit. Retrying in {delay}s...")
                else:
                    print(f"Teacher API Error {response.status_code}: {response.text}")
            except requests.exceptions.Timeout:
                print("Request timed out. Retrying...")
            except Exception as e:
                print(f"Request Exception: {e}")
            
            time.sleep(delay)
        return None

    def generate_training_sample(self, chunk_text, file_name, page_num):
        """
        1. Grab a random chunk from the vault.
        2. Ask Teacher to generate a complex question + reasoning process.
        """


        system_prompt = (
            "You are a Senior FRA Rail Safety Expert. Your task is to generate training data "
            "for a student model. Based on the provided manual excerpt, create a challenging "
            "technical question and a perfect response following the 4-Phase Thinking Process.\n\n"
            "PHASE 1: CONTEXTUAL AUDIT\nPHASE 2: EVIDENCE MAPPING\nPHASE 3: SYNTHESIS\nPHASE 4: VERIFICATION\n\n"
            "Output MUST be in valid JSON format: "
            "{'question': '...', 'thinking': '...', 'answer': '...'}"
        )

        user_query = f"MANUAL EXCERPT ({file_name}, Page {page_num}):\n{chunk_text}"
        
        raw_output = self._call_teacher(system_prompt, user_query)
        if not raw_output:
            return None

        # Clean JSON if model included markdown blocks
        clean_json = raw_output.replace("```json", "").replace("```", "").strip()
        try:
            return json.loads(clean_json)
        except:
            return None

    def create_dataset(self, num_samples=100, output_path="data/training/rail_dataset.jsonl"):
        samples = []
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        
        print(f"Starting generation of {num_samples} sample(s)...")
        # Get random chunks from collection
        all_data = self.vault.collection.get()
        if not all_data or not all_data['documents']:
            return None
        total_chunks = len(all_data['documents'])
        # idx = random.randint(0, len(all_data['documents']) - 1)
        actual_sample_count = min(num_samples, total_chunks)
        indices = random.sample(range(total_chunks), actual_sample_count)
        if num_samples >= total_chunks:
            print(f"only {total_chunks} sample(s) available, so creating {actual_sample_count} unique sample(s)")

        for idx in indices::
            chunk_text = all_data['documents'][idx]
            file_name = all_data['metadatas'][idx].get('source', 'Unknown')
            page_num = all_data['metadatas'][idx].get('page', '?')
            sample = self.generate_training_sample(chunk_text, file_name, page_number)
            if sample:
                samples.append(sample)
                # Append to file immediately so you don't lose data if it crashes
                with open(output_path, 'a') as f:
                    f.write(json.dumps(sample) + "\n")
                print(f"Generated {i+1}/{num_samples}")
            
            # MANDATORY COOL-DOWN: 
            # 3 seconds between requests helps stay under the 20 RPM limit
            time.sleep(3) 
            
        return output_path

Any object of this class is initialized with an object to the RailVectorVault. The create_dataset() attribute of this class creates a JSON Lines file. It retrieves random (without replacement) chunks from the vector db, along with its metadata, and passes an engineered system prompt to an API call to gemini (or any other service). Few other ways to generate the training examples might include

Self-Instruct, where you take 5-10 human written examples and ask the model to generate similar examples of its own. This can be tricky if the model keeps repeating the same logic over and over.
Log-Mining, where you look at the history of how people used the RAG pipeline developed in phase 2, and use the good answers as training data. Here the system must already be in production
Data-Augmentation, where you take high quality Q&A pairs and use an LLM to rewrite it in several different ways by changing tone, adding noise, etc.

If using proprietary data, it is imperative to either mask any PII or any information that shouldn’t leave a closed ecosystem or use an offline model to generate these samples. Also, if the training data is 100% AI generated, it cannot be copyrighted. So, it is a good idea to mix in a % of Human in the Loop data while generating this training data.

This is an example of a generated example

{"question": "A maintenance technician inspects an active grade crossing warning system and notes that one of the gate arm lights is illuminated but appears significantly dimmer than the other lights on the same gate arm. During daylight hours, this particular light is difficult to discern from the prescribed approach distance of 200 feet, although it is clearly visible at night. The light unit itself and its wiring are observed to be securely fastened to the gate arm.\n\nBased on FRA regulations under Section 234-29 (implied 234.219), which specific defect classification(s) apply to this situation, and what is the overarching intent of this section that is being violated by this condition?", "thinking": "PHASE 1: CONTEXTUAL AUDIT\n1. **Source Document**: FRA-Signal_Train_Control-2012.pdf, Page 124, Section 234-29 (which corresponds to 234.219 for defect codes).\n2. **Core Subject**: Maintenance, visibility, and securement of gate arm lights and wires at active warning systems.\n3. **Key Requirements/Provisions**: Each gate arm light must be properly visible to approaching highway users (and pedestrians), lights and wires must be securely fastened, and maintenance must adhere to design specifications.\n4. **Specific Defect Classifications**: A1 (burned out/missing light), A2 (defective/not visible/missing light unit), A3 (light unit not secured), A4 (wires not secured), A5 (not maintained per design specifications).\n5. **Intent Statement**: \"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\" This is critical.\n6. **Question Goal**: Create a challenging scenario requiring differentiation between similar defect codes and linking to the section's overarching intent.\n\nPHASE 2: EVIDENCE MAPPING\n1. **Analyze Scenario Elements**: \n * \"illuminated but appears significantly dimmer\": Rules out A1 ('burned out or missing' literally).\n * \"difficult to discern from the prescribed approach distance of 200 feet during daylight hours\": Directly violates the requirement for lights to be \"properly visible to approaching highway users.\" This strongly points to A2 ('not visible') and A5 ('not maintained per design specifications').\n * \"clearly visible at night\": Confirms the light is functional but performance is compromised under specific conditions (daylight), reinforcing a maintenance/design specification issue.\n * \"light unit itself and its wiring are observed to be securely fastened\": Rules out A3 ('Light unit not securely fastened') and A4 ('Light wires not securely fastened').\n\n2. **Map Scenario to Defect Codes**: \n * **234.219.A1 (Gate arm light burned out or missing)**: *Not applicable* because the light is illuminated.\n * **234.219.A2 (Gate arm light unit defective, not visible, or missing)**: *Applicable*. The light is \"not visible\" from the required distance during daylight, and its dimness suggests the unit is \"defective\" in its performance.\n * **234.219.A3 (Light unit not securely fastened)**: *Not applicable* as it's stated to be securely fastened.\n * **234.219.A4 (Light wires not securely fastened)**: *Not applicable* as wiring is stated to be securely fastened.\n * **234.219.A5 (Gate arm light unit not maintained per design specifications)**: *Applicable*. The diminished brightness and failure to be visible under daylight conditions indicate a failure to meet design specifications for performance and maintenance. This is further supported by the section's stated intent.\n\n3. **Identify Overarching Intent**: The excerpt explicitly states: \"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\" The scenario directly demonstrates a failure to meet this, as the light's performance (visibility) has degraded from its specified standard.\n\nPHASE 3: SYNTHESIS\n* **Applicable Defects**: 234.219.A2 (due to being not visible/defective) and 234.219.A5 (due to not meeting design specs for visibility/brightness). \n* **Non-Applicable Defects**: 234.219.A1, A3, A4 (with clear justifications).\n* **Overarching Intent Violation**: Failure to maintain lights in accordance with design specifications, which directly impacts their intended visibility for safety.\n\nPHASE 4: VERIFICATION\n* **Completeness**: All parts of the question (specific classifications, overarching intent) are addressed.\n* **Accuracy**: The classifications and reasoning directly align with the provided text. The distinction between A1 and A2/A5 is carefully made.\n* **Clarity**: The answer is structured logically, explaining why each code applies or doesn't apply.\n* **Challenging Aspect**: The question forces the student to go beyond simple identification (e.g., 'burned out') to interpret 'not visible' and 'not maintained per design specifications' in a nuanced scenario where the light is technically 'on'. It also requires understanding the explicit 'intent' statement.", "answer": "The specific defect classifications applicable to this situation are:\n\n1. **234.219.A2: Gate arm light unit defective, not visible, or missing.**\n * The scenario clearly states that the light, despite being illuminated, is \"difficult to discern from the prescribed approach distance of 200 feet during daylight hours.\" This directly meets the \"not visible\" criterion as required for approaching highway users. Furthermore, the description of it being \"significantly dimmer\" strongly indicates that the light unit itself is \"defective\" in its performance, as its output has degraded below acceptable operating levels.\n\n2. **234.219.A5: Gate arm light unit not maintained per design specifications.**\n * The observed diminished brightness and subsequent lack of visibility under daylight conditions demonstrate a failure to maintain the gate arm light unit to its intended design specifications for luminance and conspicuity. The proper functioning and visibility of the light are fundamental design requirements, and their degradation constitutes a failure in maintenance according to those specifications.\n\nThe following defect classifications are *not* applicable:\n\n* **234.219.A1 (Gate arm light burned out or missing)**: This is not applicable because the light is explicitly stated to be \"illuminated,\" meaning it is neither burned out nor physically missing.\n* **234.219.A3 (Light unit not securely fastened to gate arm)**: This is not applicable as the scenario states the \"light unit itself... [is] observed to be securely fastened.\"\n* **234.219.A4 (Light wires not securely fastened to gate arm)**: This is not applicable as the scenario states the \"wiring [is] observed to be securely fastened.\"\n\nThe **overarching intent** of this section that is being violated is precisely stated in the manual excerpt: **\"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\"** The observed condition\u2014a significantly dimmer light that is not adequately visible in daylight\u2014directly contravenes this intent by demonstrating a failure to maintain the gate arm light unit's performance to its designed safety and operational standards for ensuring proper warning to approaching highway users."}

The training is done in https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/03_fine_tuning.ipynb. The basic steps involve

loading a pretrained model (Llama-3.2-3B)
attach small trainable adapter layers
train only those adapters (weight matrices of these adapters)
save the newly trained adapters.

I use unsloths FastLanguageModel class for loading and preparing LLM for efficient fine tuning. Unsloth is spcifically meant to reduce memory and make Quantized Low Ranking Adapters (QLoRA) more practical on smaller hardwares. In the from_pretrained() function,the most important parameter would be ‘load_in_4bit’ = True. This compresses the weights of the model agressively, and thus takes less memory on the GPU (with a tradeoff of slightly less numerical precision). This function returns the model (the neural network) and the tokenizer (turns text into tokens).

In the next step we do get_peft_model (PEFT = Parameter Efficient Fine Tuning). This is the step where I turn the base model into a QLoRA fine tuned model. The Parameter Efficient part of PEFT means we do not retrain the large model. Only train the extra small set of weights (Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)). The arguments for this function include

r
This is the LoRA rank. Less the number means less the flexibility of the model. 16 is a balanced choice between lighter models and models that would require much higher memory.
target_modules
This lists out the part of the transformer where the LoRA adapters are inserted. Here I insert adapters in the attention block and the feed forward block.
lora_alpha
Scaling factor for the LoRA update. 16 is commonly used.
lora_dropout
A regularization method. Unsloth recommends 0 for efficiency and speed.
bias
Train the bias or not. Here it is set as False to only train the weights. This is a standard LoRA setting.
use_gradient_checkpointing
Gradient checkpointing saves GPU memory by not storing every intermediate activation during a forward pass.

In the next step, the JSONL file is loaded. This file is structured with 3 columns, ‘question’, ‘thinking’, and ‘answer’. I wrote a function that takes each line and turns it into a conversation string. This is done as the base model is a instruction/chat model. Instead of 3 separate fields, I have one long text field containing the full conversation (stored in the variable messages).

def formatting_prompts_func(examples):
    instructions = examples["question"]
    thoughts = examples["thinking"]
    answers = examples["answer"]
    texts = []

    for instruction, thought, answer in zip(instructions, thoughts, answers):
        # We wrap the thinking and answer together as the 'Assistant' response
        full_response = f"[THINKING PROCESS]\n{thought}\n\n[ANSWER]\n{answer}"

        messages = [
            {"role": "system", "content": "You are a Senior FRA Safety Consultant. Use a 4-Phase Thinking Process."},
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": full_response},
        ]

        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        texts.append(text)
    return { "text" : texts, }

The function apply_chat_template() takes a structured chat message and convert it into the exact string format expected by the model. tokenize = False returns the response as a string (and not tokens), and add_generation_prompt=False means that no additional text is appended (this is already present in messages). I don’t tokenize and pass the text to the SFTTrainer (specified in the next step). Unsloth uses its specialized fast kernals which handles padding more efficiently than Pytorch.

In the next step, I use the SFTTrainer, and SFTConfig classes from trl (Tranformer Reinforcement Learning) which are high level training helpers. (SFT = Supervised Fine-Tuning). In this step we show the model prompt, and desired response (training data), and teach it to imitate them. The SFTTrainer does

prepare the dataset
tokenizing text
creating batches
runnning the forward pass
calculate loss
backpropagation
update trainiable parameters

SFTConfig stores the configuration settings of the trainer. One important argument is the data_collator. This component takes a list of dataset examples and turns them into one batch. However, the SFTTrainer tokenized the data but also created the text field. This needs to be removed. The DropTextCollator acts as a gatekeeper and throws away the human readable text.

Once the trainer is initialized, trainer.train() is called to complete the training. The below is what was outputted during the training process.

Unsloth - 2x faster free finetuning | Num GPUs used = 1
Num examples = 101 | Num Epochs = 5 | Total steps = 60
Batch size per device = 1 | Gradient accumulation steps = 8
Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)

Here is an explanation of the text here:

Num examples = 101
The number of training examples sent
Num Epochs
The number of full passes thru the entire dataset. With 60 steps, and batch size of 8, it is approximately 5 passes $\frac{60 \times 8}{101}$
Total steps
This is the number of times the model updated its weights
Total batch size = 8
\[\text{Batch Per Device (1)} \times \text{Accumulation (8)} \times \text{GPUs (1)} = 8\]
Trainable parameters
Only the adapter matrices are trained which involves ~24M parameters (as opposed to the 3B)

Few other optimizaiton tweaks that were done include:

small batch size (per_device_train_batch_size = 1)
gradient accumulation (gradient_accumulation_steps = 8)
8 bit optimizer (optim = “adamw_8bit”)
disabled packing (packign=False) to prevent VRAM spikes.

Quick Evaluation

I tried a new prompt which was not present in the training data to see if it worked well.

Evaluation Query

A rail yard manager is overseeing the movement of a tank car containing 
liquefied petroleum gas (LPG). The car has a small leak at the top valve. 
The manager wants to move the car to a remote siding for repair without 
a special permit, arguing it's safer to move it than leave it near the main office. 
Based on the Hazardous Materials Compliance Manual, is this allowed?

Answer

THINKING PROCESS
PHASE 1: CONTEXTUAL AUDIT

1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, file page 28, Chapter 4, § 4.6.1-1.
2. Core Subject: Movement of hazardous materials in rail yards.
3. Key Manual Points: rail yards are subject to the HMR; rail yards are considered a rail-yard operation; movement of hazardous materials is subject to the permit system; the permit system applies even if the movement is within the yard.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.

PHASE 2: EVIDENCE MAPPING

1. The car contains LPG, a listed hazardous material.
2. The leak is on the car, not the yard.
3. The proposed movement is within the yard.
4. The page explicitly says the permit system applies to all rail-yard operations involving hazardous materials.

PHASE 3: SYNTHESIS

- Conclusion: the movement requires a special permit, regardless of the yard location.
- Why alternatives are wrong: they treat the yard as a non-permit operation.

PHASE 4: VERIFICATION

The response relies only on the cited manual page(s) and uses the file page number as metadata.

ANSWER
No. The manual states that rail yards are subject to the permit system, even if the movement is within the yard. It also states that the permit system applies to all rail-yard operations involving hazardous materials. Therefore, the manager must obtain a special permit before moving the car, even if it's within the yard.

Note that,

I didn’t prompt as I did earlier
The answer adheres to the format in which I trained the model
However, it hallucinates and cited page 28, Chapter 4.6.1-1 (it doesn’t exist).

Fine-tuning a 3B model teaches it Reasoning, not Knowledge Retrieval. Even after tuning, the model will hallucinate citations if context isn’t provided. This proves that Fine-Tuning is not a replacement for RAG, but a way to make RAG more intelligent. In the next phase, I added chunks to the context to see if it helped solve the hallucination problem.

Phase 3.2 : Improving the retriever

I passed this question “A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit. Is this allowed?”. With this question, I tested using n_results = 5, and compared the results with the source filter on and off. (For this particular question, the answer is in the Hazardous Materials Compliance Manual)

With source filter:

Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (Preview: Materials Guidance (HMG)-127 available for review on FRA’s website.
11.2.2 COFC/TOFC Approval
A...)
Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL
30
CHAPTER 11 – NOTIFICATIONS AND APPROVALS
11.1 Notifica...)
Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 10 (Preview: transportation. Section 172.820, Additional planning requirements for transportation by rail,
requi...)
Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 36 (Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL
31
11.2.3 Cylinders in COFC/TOFC Service
Cylinders may...)
Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when the
violation is kno...)

THINKING PROCESS

PHASE 1: CONTEXTUAL AUDIT

1. Source document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, Chapter 11.2.2 One-Time Movement Approvals.
2. Core subject: One-time movement approval for non-complying bulk packages.
3. Key manual points: FRA may allow movement of non-complying bulk packages for a special purpose (e.g., repair) under specified conditions ensuring the safety of the rail movement; the approval does not relieve a party from any liability applicable to such movements.
4. Question goal: Test precise application of the manual to a mixed-fact scenario.

PHASE 2: EVIDENCE MAPPING

1. The scenario involves a leaking tank car.
2. The page specifically addresses non-complying bulk packages.
3. The cited approval process is for special purposes like repair.

PHASE 3: SYNTHESIS

- Conclusion: the movement is allowed under the one-time movement approval process for repair purposes.
- Why alternatives are wrong: they assume the repair is not a special purpose.

PHASE 4: VERIFICATION

The response relies only on the cited manual page(s) and uses the fact pattern to test precise application of the manual to the scenario.

ANSWER

No. The page says that FRA may allow movement of non-complying bulk packages for a special purpose (e.g., repair) under specified conditions ensuring the safety of the rail movement. The cited approval process is for repair purposes, not for transportation. The page also says that the approval does not relieve a party from any liability applicable to such movements.

Without source filter:

Chunk 1: Source: FRA-Signal_Train_Control-2012.pdf, Page: 27 (Preview: arrangements for a tentative inspection.
While an unaccompanied inspection is permitted, such ins...)
Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 279 (Preview: The provision of exception (2) does not apply to maintenance-of-way work equipment. Such
maintenan...)
Chunk 3: Source: FRA-Track_Structures_CM_2026_Vol_III_Ch1_Railroad_Workplace_Safety_General.pdf, Page: 8 (Preview: switches and bridges that are moveable via a control machine (by a train dispatcher or
control oper...)
Chunk 4: Source: FRA-Signal_Train_Control-2012.pdf, Page: 278 (Preview: except signaled sidings, exceeds 20 mph, each hand-operated switch must comply with this
section. ...)
Chunk 5: Source: FRA-Signal_Train_Control-2012.pdf, Page: 48 (Preview: the requirement.
§ 236.16 When an inspector finds a main track electric lock releasing circuit t...)

THINKING PROCESS

PHASE 1: CONTEXTUAL AUDIT

1. Source Document: FRA-Signal_Train_Control-2012.pdf, page 27.
2. Core Subject: Limited unaccompanied inspection of signals, switches, and appliances.
3. Key Manual Points: inspectors are limited to visual inspections; no access to tests, relays, circuit plans, etc.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.

PHASE 2: EVIDENCE MAPPING

1. The scenario involves a non-permitted move.
2. The page explicitly limits inspection activity to visual inspections.
3. The page also says such inspection should be arranged at a later date if a railroad officer declines to accompany or fails to provide someone else.

PHASE 3: SYNTHESIS

- Conclusion: the move is not allowed because it involves a non-permitted action.
- Why alternatives are wrong: they ignore the page's explicit limitation on inspection activity and the page's instruction to arrange a later inspection if a permit is not available.

PHASE 4: VERIFICATION

The response relies only on the cited manual page(s) and uses the page's language to resolve the mixed-fact scenario.

ANSWER

No. The page says that on an unaccompanied inspection, the inspector's activities are limited to making visual inspections of signals, switches, and appliances. The page also says that such inspection should be arranged at a later date if a railroad officer declines to accompany or fails to provide someone else to accompany the inspector. The cited page does not permit the unaccompanied movement of a non-permitted item like a leaking LPG tank car.

I see that without a source filter, the chunks retrieved are not even from the Hazardous Materials Compliance Manual. This tells me that we need to improve the retriever. To fix this, I updated the RailVectorVault class and added the following two methods.

hybrid_query()
rerank_query()

The Metrics

Hit@1 : This is the most critical metric. It tells you how often the very first result was the correct chunk. At 0.467, the Reranker is getting the exact right page nearly 47% of the time.
Hit@3 / Hit@5 : These tell you how often the correct answer was “in the top 3” or “in the top 5.” The Reranker has a 72.4% Hit@5. This means that if you feed 5 chunks to the LLM, it has the “truth” in front of it 72% of the time.
MRR (Mean Reciprocal Rank): This is the average of $1 / \text{Rank}$. If the answer is at #1, you get 1.0.If it’s at #2, you get 0.5. The Reranker MRR of 0.567 means that, on average, the correct answer is sitting between Rank 1 and Rank 2.

	hit@1	hit@3	hit@5	mrr
Vector	0.216	0.317	0.378	0.284
Hybrid	0.250	0.409	0.508	0.362
Rerank	0.467	0.663	0.724	0.568
Rerank with BGE	0.416	0.665	0.734	0.544
Rerank with BGE and rerank instructions	0.520	0.720	0.760	0.619
Rerank with BGE and rerank instructions (increase n_initial to 50)	0.429	0.703	0.781	0.569

By moving from simple Vector search to a Hybrid-Rerank architecture, we increased our Hit@1 accuracy by 116% (from 21% to 46%). While the retrieval takes longer, it ensures the LLM is reasoning over the correct safety regulations.

The metrics for Rerank with BGE model went down. However, when I added a rerank instruction (“Query” and “Passage”) the Hit@1 jumped from 0.416 to 0.520. That is a 25% relative improvement just by changing a few characters of text. A Hit@5 of 0.760 means that for 3 out of every 4 safety questions, the LLM will have the correct regulation in its immediate context. An MRR of 0.62 confirms that the correct answer is almost always the very first or second thing the model suggests.

I tried rerank with an increased n_initial as well which reduced the MRR. Hit@5 went UP ($0.760 \rightarrow 0.781$). This proves that increasing n_initial worked as intended for Recall. However, I increase the number of initial retrieved samples to 50 or 100, we are inviting “Hard Negatives” into the room. Based on this we can make a decision on the n_initial parameters as follows. - If the goal is LLM Accuracy: Stick with the pool size that yields the highest Hit@5, because the fine-tuned Llama model is smart enough to ignore the 4 “distractor” chunks if the “Gold” chunk is present. - If the goal is Retrieval Efficiency: Stick with the n_initial that yields the highest MRR (which seems to be $n=25$ in this case).

Phase 4: Reasoning-Augmented RAG

In this step, I added the retrieved chunks for each query and added it to the context. This is available at https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/04_ra_rag.ipynb. The function below takes the question (prompt), the vault (or the vector store), along with other arguments (n_results to select how many chunks we want the query to retrieve, source_filter to filter which source we want the chunks from, and show_context to help in debugging by showing the retrieved contexts).

def run_integrated_audit(question, vault, n_results = 10, source_filter = None, show_context = False):

    search_params = {"n_results": n_results}
    if source_filter:
        search_params["where"] = {"source": source_filter}
    query_vector = vault.embedder.generate_embeddings([question])
    query_list = query_vector.tolist() if hasattr(query_vector, 'tolist') else query_vector
    results = vault.collection.query(
        query_embeddings=query_list,
        **search_params
    )

    if show_context:
        print(f"--- LIBRARIAN REPORT ({len(results['documents'][0])} chunks found) ---")
    context_parts = []
    if results['documents']:
        for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
            if show_context:
                print(f"Chunk {i+1}: Source: {meta['source']}, Page: {meta['page']} (Preview: {doc[:100]}...)")
            context_parts.append(f"[SOURCE: {meta['source']}, PAGE: {meta['page']}]\n{doc}")
    print("-" * 40)

    # # Extract text and metadata for the prompt
    # context_parts = []
    # for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
    #     context_parts.append(f"[SOURCE: {meta['source']}, PAGE: {meta['page']}]\n{doc}")

    context = "\n---\n".join(context_parts)


    messages = [
        {"role": "system", "content": "You are a Senior FRA Safety Consultant. Use your 4-Phase Thinking Process. Answer ONLY based on the provided context."},
        {"role": "user", "content": f"CONTEXT FROM MANUALS:\n{context}\n\nQUESTION:\n{question}"},
    ]


    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=1024,
            use_cache=True,
            temperature=0,
            do_sample=False
        )

    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return response

Rerunning with reranker retriever.

test_question = “A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit. Is this allowed?”

Feeding only 5 retrieved elements to the chunk.

With Source Filter

--- LIBRARIAN REPORT: Question 1 (5 chunks via RERANK) ---

- Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR...
- Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C...
- Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_525) Preview: investigation and decision-making efforts on issues related to tank car safety. ...
- Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t...
- Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_572) Preview: applicable to such movements. Additionally, an OTMA approval issued by FRA is on...Both `max_new_tokens` (=1024) and `max_length`(=131072) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

PHASE 1: CONTEXTUAL AUDIT

1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, § 11.2.2 COFC/TOFC Approval.
2. Core Subject: Movement of non-complying bulk packages by rail within the United States.
3. Key Manual Points: COFC/TOFC movement is allowed only under conditions specified in 49 C.F.R. § 174.63 or approved by the Associate Administrator for Safety, FRA.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.


PHASE 2: EVIDENCE MAPPING

1. The scenario lacks a permit.
2. The page says permits are required for such movements.
3. The cited section also says the approval is only applicable to movements within the United States.


PHASE 3: SYNTHESIS

- Conclusion: the movement is not allowed without a permit.
- Why alternatives are wrong: they ignore the cited permit requirement.

PHASE 4: VERIFICATION

The response relies only on the cited manual page(s) and uses the manual's language exactly.



ANSWER

No. The page says that movement of a bulk packaging (e.g., portable tank, IM portable tank, IBC, Large Packaging, cargo tank, or multi-unit tank car tank) containing a hazardous material in container-on-flatcar (COFC) or trailer-on-flatcar (TOFC) service is allowed only under conditions specified in 49 C.F.R. § 174.63 or approved by the Associate Administrator for Safety, FRA. The cited section also says the approval is only applicable to movements of a nonconforming bulk package/packaging by rail within the United States. The cited page does not allow the movement without a permit.

Without Source Filter.

--- LIBRARIAN REPORT: Question 1 (5 chunks via RERANK) ---

- Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR...
- Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 276 (ID: id_2304) Preview: 236-108 points is occupied. The exception to the traffic locking requireme...
- Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C...
- Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t...
- Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta...

THINKING PROCESS

PHASE 1: CONTEXTUAL AUDIT

1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Chapter 11, page 35, § 11.2 One-Time Movement Approvals.
2. Core Subject: One-time movement approval for non-complying bulk packages for special purposes.
3. Key Manual Points: FRA may grant one-time movement approval for special purposes under specified conditions; the approval does not relieve liability for the movement; the approval is only for non-complying bulk packages.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.



PHASE 2: EVIDENCE MAPPING

1. The scenario involves a leaking tank car.
2. The repair is in a remote siding.
3. The page says the approval is for non-complying bulk packages for special purposes.
4. The page also says the approval does not relieve liability.

PHASE 3: SYNTHESIS

- Conclusion: the one-time movement approval is not a blanket permission to move any leaking tank car.
- Why alternatives are wrong: they treat the repair as a general justification for movement, ignoring the page's specific limitation.

PHASE 4: VERIFICATION

The response relies only on the cited manual page(s) and uses the manual's language exactly.

ANSWER

No. The page says one-time movement approval is available for non-complying bulk packages for special purposes under specified conditions, not for any leaking tank car. The approval does not relieve the carrier from liability for the movement. The page specifically limits the approval to non-complying bulk packages.

Even though both versions correctly answered “No,” the first version was “smarter” (citing § 174.63) while the second version was more “general.”

The “174.63” Disappearing In the Filtered run, the Librarian gave the Analyst five high-quality Hazmat chunks. Chunk 5 (ID 572) was included in the context. Because the Analyst had that specific data point, it could provide a “High-Resolution” answer. In the Unfiltered run, look at Chunk 2 (ID 2304): Source: FRA-Signal_Train_Control-2012.pdf, Page: 276. This Signal Manual chunk is not relevant for the question asked. Because it took up a slot in the context window, it displaced Chunk 5 (ID 572). Without ID 572, the Analyst lost the specific knowledge of § 174.63. It had to rely on the remaining chunks (like Chapter 11.2) to find a reason to say “No.”
The Success of Phase 3 (Reasoning) The impressive thing here is that both answers are still correct. Even when the retriever made a mistake and brought a Signal book to the desk, the LLM was trained well enough to:

Ignore the irrelevant Signal data.
Synthesize a “No” based on the general OTMA rules it did have.
Maintain the 4-Phase Thinking Process perfectly.

The fact that the Signal manual (ID 2304) is still showing up in your Top 2 of the Unfiltered Rerank is the proof of why you need the source_filter or a stronger Hybrid weight.

On looking at the two interesting chunks, the reason why 2304 was picked (with no source filter) becomes clearer.

This is id_572

applicable to such movements. Additionally, an OTMA approval issued by FRA is only 
applicable to movements of a nonconforming bulk package/packaging by rail within the United 
States.
38 
 
The OMTA process also provides an informational database that can identify systemic problems 
with a particular series of tank car or valve, as well as documents the root cause and contributing 
factors of defective conditions.   
 
The information obtained as a result of the movement approval process may identify facilities 
that have procedural problems and require assistance in ensuring that their practices, in regard to 
package preparation for transportation, are adequate to ensure regulatory compliance and safety.  
 
Further guidance on the OTMA process is explained in the current version of Hazardous 
Materials Guidance (HMG)-127 available for review on FRA’s website. 
11.2.2    COFC/TOFC Approval
{'char_count': 905, 'source': 'FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf', 'page': 35, 'type': 'technical_manual'}

This is id_2304

236-108 
 
points is occupied. 
 
The exception to the traffic locking requirements (§ 236.405) applies only in instances when a 
train is left on the main track while its engine and/or cars move into an adjacent siding or yard 
for switching purposes and must, in returning to its train, reverse its direction for a short 
distance.  It is permissible in such instances to permit such movements to be made with a signal 
aspect indicating “Proceed at Restricted Speed” into the occupied block. 
 
 
CLASSIFICATION OF DEFECTS
 
 
236.405.A1 On track signaled for movements in bot h directions, occupancy of track between 
opposing signals at adjacent controlled points does not prevent changing the 
direction of traffic from that which obtained at the time the track became 
occupied.  (Note:  An exception added January 24, 1966, permits display of an 
aspect not less restrictive than that indicating “proceed at restricted speed” by a
{'type': 'technical_manual', 'source': 'FRA-Signal_Train_Control-2012.pdf', 'char_count': 938, 'page': 276}

Word Overlap
The test question was: “A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit.” Now, look at the vocabulary density in the Signal chunk (id_2304):

“move into an adjacent siding or yard” (Direct hit on “move”, “siding”, and “yard”).
“for switching purposes” (Semantically close to the intent of moving a car).
“permissible in such instances to permit such movements” (Direct hit on “permissible/allowed” and “permit” and “movements”).

To a vector model, id_2304 looks like a perfect match because it describes the exact physical action the manager wants to take: moving a car into a siding.

Semantic Gap
Now, look at the Hazmat chunk (id_572) that we wanted the model to prioritize:

It uses terms like “nonconforming bulk package/packaging” instead of “leaking tank car.”
It uses “OTMA approval” instead of “permit.”
It doesn’t mention the word “siding” or “yard” at all.

Because id_572 is written in “Regulatory Speak” and your question was written in “Operational Speak,” the distance-based search (Vector) and the keyword search (BM25) both found id_2304 more compelling than id_572.

Why the Reranker Struggled
The Signal chunk was a “Hard Negative.” It contains almost all the keywords of the question but belongs to a completely different regulatory universe.

Vector Search saw: “Move,” “Siding,” “Yard,” “Permit.”
BM25 saw: “Move,” “Siding,” “Yard,” “Permit.”
The Result: Both “Librarians” put the Signal chunk at the top of the pile because it had a higher “keyword density” for the operational parts of the test question.

Feeding 10 retrieved elements to the chunk.

This caused a context window overflow. I updated the max_seq_length of the model to 4096 to be able to pass 10 chunks as contexts into the LLM.

With source filter n_result = 10

--- LIBRARIAN REPORT: Question 1 (10 chunks via RERANK) ---

- Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR...
- Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website.11.2.2 C...
- Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_525) Preview: investigation and decision-making efforts on issues related to tank car safety. ...
- Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t...
- Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_572) Preview: applicable to such movements. Additionally, an OTMA approval issued by FRA is on...
- Chunk 6: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 36 (ID: id_575) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 31 11.2.3 Cylinders in COFC/TOFC Se...
- Chunk 7: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta...
- Chunk 8: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_523) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 13 3.4 Responsibilities of Tank Car Te...
- Chunk 9: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 11 (ID: id_504) Preview: • Part 174 – Carriage by Rail • Part 178 – Specifications for Packagings • Par...
- Chunk 10: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_524) Preview: organizations for tank car safety initiatives. • Provides subject matter expert...


THINKING PROCESS
PHASE 1: CONTEXTUAL AUDIT
1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, Chapter 11.1 Notifications.
2. Core Subject: Notification requirements for certain tank car movements.
3. Key Manual Points: the shipper must notify FRA of non-received tank cars within 20 days; the notification may be made by email or phone; the movement is only allowed if the car is repaired or replaced within 20 days.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.

PHASE 2: EVIDENCE MAPPING
1. The scenario involves a tank car that is leaking.
2. The scenario involves movement without a permit.
3. The manual specifically addresses notification for non-received tank cars.

PHASE 3: SYNTHESIS
- Conclusion: the movement is not allowed without a permit.
- Why alternatives are wrong: they ignore the notification requirement.

PHASE 4: VERIFICATION
The response relies only on the cited manual page(s) and uses the page number as metadata.


ANSWER

No. The manual says that the shipper must notify FRA of non-received tank cars within 20 days from the date of shipment. The page also says that the movement is only allowed if the car is repaired or replaced within 20 days. The cited page does not allow the movement to proceed without a permit.

Without source_filter n_result = 10

--- LIBRARIAN REPORT: Question 1 (10 chunks via RERANK) ---

- Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR...
- Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 276 (ID: id_2304) Preview: 236-108 points is occupied. The exception to the traffic locking requireme...
- Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C...
- Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t...
- Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta...
- Chunk 6: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_523) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 13 3.4 Responsibilities of Tank Car Te...
- Chunk 7: Source: FRA-Signal_Train_Control-2012.pdf, Page: 279 (ID: id_2319) Preview: The provision of exception (2) does not apply to maintenance-of-way work equipme...
- Chunk 8: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 25 (ID: id_543) Preview: (e.g., classification, training, OTMA implementation, package securement). Shipp...
- Chunk 9: Source: FRA-HRGC-Handbook-2019.pdf, Page: 171 (ID: id_1088) Preview: Solutions to Blocked Crossings Appendix C to the Noteworthy Practices Guide add...
- Chunk 10: Source: FRA-Signal_Train_Control-2012.pdf, Page: 278 (ID: id_2313) Preview: except signaled sidings, exceeds 20 mph, each hand-operated switch must comply w...

THINKING PROCESS

PHASE 1: CONTEXTUAL AUDIT
1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Chapter 7.2.6 Tank Car Owner Inspection.
2. Core Subject: Tank car owner inspection and repair of leaking tank cars.
3. Key Manual Points: the tank car owner is responsible for ensuring the tank car is in compliance with the functions listed; the inspection includes classification, training, OTMA implementation, package securement, and safety and security; the inspection is comprised of four general regulatory elements.
4. Question Goal: Test precise application of the manual to a mixed-fact scenario.

PHASE 2: EVIDENCE MAPPING
1. The scenario involves a leaking tank car.
2. The page lists four inspection elements.
3. The repair is on a remote siding, not at the facility.

PHASE 3: SYNTHESIS
- Conclusion: the repair must be done under the inspection program, not outside of it.
- Why alternatives are wrong: the page lists inspection as a regulatory function, not repair.

PHASE 4: VERIFICATION
The response relies only on the cited manual page(s) and uses the cited manual points as the basis for the conclusion.

ANSWER

No. The page says the tank car owner is responsible for ensuring the tank car is in compliance with the functions listed, including classification, training, OTMA implementation, package securement, and safety and security. The inspection is comprised of four general regulatory elements. The page does not allow repair outside the inspection program. The cited page says the inspection is comprised of four general regulatory elements: Special Permits, registration, training, document retention, package selection, as well as safety and security. Repair of a leaking tank car is one of the functions the page says the owner must perform.

In this case, more data made the model “dumber.” In the 5-chunk run, the “Signal-to-Noise Ratio” was high. The model had fewer options, so it had to look more closely at the text (e.g. OTMA (One-Time Movement Approval) text.) In the 10-chunk run, the Information Density dropped.

Path Forward

Revert to n_results = 5: This was your “Sweet Spot.” It provided enough evidence for Hit@5 accuracy without overwhelming the model’s attention.
Trust the Reranker: Since your Reranker Hit@1 is 0.52, you should trust that the top 5 results are significantly better than the next 5. Adding results 6 through 10 is clearly introducing “distractor” logic.
Use metadata filter when possible.

Evaluation

The RA-RAG system is subjected to evaluation using a bigger model. LLM as a judge. I used the following rubrics.
1. FAITHFULNESS (1-5): Is the answer derived ONLY from the context provided in the thinking process? (1 = Hallucinated/Used external knowledge, 5 = Perfectly Grounded) 2. REGULATORY ACCURACY (1-5): Compare the AI Answer to the Ground Truth Reference. Does the logic match? (1 = Dangerous/Incorrect, 5 = Expert accuracy) 3. CITATION QUALITY (1-5): Did the model cite specific Pages/Sections correctly within the final answer as per the Thinking Process?

TODO: Complete this.

Summary of Lessons Learned

Without RAG, the LLM would depend on the entire context it was trained on, and only know until the training cutoff date. RAG allows us to give the LLM context that it needs to answer questions. Without RAG, an LLM can hallucinate and generate text that looks like what it is supposed to look like.
Tweaking the retriever gives better benefit as if we can’t get the right context to the engine, it will not be able to provide answers
Hybrid search and metadata filtering can be very important for professional applications. Keywords acts as a hard filter,
We can train the engine to reason the way we want it to based on training examples. But fine tuning is for behavior, not facts.
As well as the size of the model, the type of info we feed to the model and the way we feed it in can have an impact.
More context doesn’t necessarily mean better answer.

TODO

Try different chunk size, overlap. Try MarkdownHeaderTextSplitter, SemanticChunker, SpacyTextSplitter, NLTKTextSplitter, TokenTextSplitter.
[DONE] Try bge-reranker-v2-m3 as a reranker model.
Fine tune the retriever using training examples.
Use vectorless RAG.
Multimodal RAG (what if we have PDF and Images?)
Ability to filter thru different elements of metadata.

--- title: "Building a Rail Safety AI Engine" author: "Jacob Mathew" date: Fri Apr 3 18:12:24 CDT 2026 categories: [AI, LLM, transformers, python, torch] format: html: theme: flatly toc: true toc-depth: 3 toc-location: right code-fold: true code-tools: true code-copy: true code-overflow: wrap smooth-scroll: true anchor-sections: true link-external-newwindow: true title-block-banner: true page-layout: article include-in-header: text: | <style> :root { --rs-accent: #0f4c81; --rs-accent-soft: #eaf3fb; --rs-border: #d9e4ef; --rs-ink: #1f2937; --rs-muted: #5f6b7a; --rs-code-bg: #0b1220; --rs-warm: #fff8ea; --rs-warm-border: #f3d28b; } body { text-rendering: optimizeLegibility; } .quarto-title-block .quarto-title-banner { background: linear-gradient(135deg, rgba(15,76,129,.95), rgba(31,111,235,.78)), radial-gradient(circle at top right, rgba(255,255,255,.22), transparent 32%); padding-top: 3.25rem; padding-bottom: 3rem; border-bottom: 1px solid rgba(255,255,255,.12); } .quarto-title-block .title { font-weight: 800; letter-spacing: -0.03em; max-width: 12ch; } .quarto-title-meta { font-size: .95rem; } main.content { max-width: 900px; } .content p, .content li { color: var(--rs-ink); font-size: 1.04rem; line-height: 1.8; } .content p { margin-bottom: 1rem; } .content h1, .content h2, .content h3, .content h4 { color: #10273f; font-weight: 750; letter-spacing: -0.02em; } .content h1 { margin-top: 2.75rem; margin-bottom: 1rem; padding-bottom: .45rem; border-bottom: 2px solid var(--rs-border); } .content h2 { margin-top: 2.35rem; margin-bottom: .75rem; } .content h3 { margin-top: 1.75rem; } .content a { color: var(--rs-accent); text-decoration-thickness: .08em; text-underline-offset: .14em; } .content ul, .content ol { padding-left: 1.35rem; } .content li + li { margin-top: .35rem; } pre, div.sourceCode { border-radius: 18px; border: 1px solid #172033; box-shadow: 0 14px 36px rgba(15, 23, 42, 0.16); } pre code, div.sourceCode code { font-size: .92rem; } code:not(pre code) { background: var(--rs-accent-soft); color: #17476d; border-radius: .45rem; padding: .14rem .38rem; } .cell { margin-top: 1.2rem; margin-bottom: 1.5rem; } table { width: 100%; border-collapse: separate; border-spacing: 0; margin: 1.5rem 0 2rem; overflow: hidden; border: 1px solid var(--rs-border); border-radius: 16px; box-shadow: 0 10px 30px rgba(15, 23, 42, 0.06); } thead th { background: #f4f8fc; color: #17324d; font-weight: 700; border-bottom: 1px solid var(--rs-border); } th, td { padding: .9rem .95rem; vertical-align: top; } tbody tr:nth-child(even) { background: #fbfdff; } blockquote { border-left: 4px solid var(--rs-accent); background: #f7fbff; border-radius: 0 14px 14px 0; padding: .9rem 1rem; color: var(--rs-muted); } hr { border-top: 1px solid var(--rs-border); opacity: 1; } .sidebar nav[role="doc-toc"] { border-left: 1px solid var(--rs-border); padding-left: 1rem; } .sidebar nav[role="doc-toc"] .active { color: var(--rs-accent) !important; font-weight: 700; } span[style*="color: red"] { display: inline-block; background: var(--rs-warm); color: #8a5a00 !important; border: 1px solid var(--rs-warm-border); border-radius: 999px; padding: .12rem .5rem; font-weight: 700; margin-right: .35rem; } img, .quarto-figure, .figure { border-radius: 18px; } @media (max-width: 991px) { .quarto-title-block .title { max-width: none; } main.content { max-width: 100%; } } </style> --- # Summary ## Project Overview ### Core Objective To build a Retrieval-Augmented Generation (RAG) system that bridges the gap between raw federal regulations (FRA manuals) and practical, field-level maintenance queries. ### Primary Goal: Transition from "black-box" AI responses to verifiable, grounded, and explainable safety auditing. ### Technical Architecture - **The "Brain" (LLM):**\ Utilizes a fine-tuned **Llama-3.2-3B-Instruct model**, optimized with 4-bit quantization (BitsAndBytes) and Unsloth to run efficiently on consumer-grade hardware like a T4 GPU. - **The "Librarian" (Vector Store):**\ Powered by **ChromaDB** using **BAAI/bge-base-en-v1.5** embeddings for semantic indexing of over 2,600 document chunks. - **Multi-Stage Retrieval:**\ Employs a sophisticated funnel starting with **Hybrid Search** (lexical BM25 + semantic vector) followed by a **Cross-Encoder Reranker**. ### Retrieval Performance Milestones - **Accuracy Boost:** Moving from simple Vector search to a Hybrid-Rerank architecture increased **Hit@1 accuracy by 116%** (from 0.216 to 0.467). - **The "Instruct" Breakthrough:** Adding "Query" and "Passage" prefixes to the BGE-reranker boosted **Hit@1 to 0.520 and MRR to 0.619**. - **Reliability:** Achieved a **Hit@5 of 0.760**, ensuring the correct regulation is in the model's context 3 out of 4 times. ### Key Engineering Insights - **The Paradox of Context:** Discovered that increasing the initial candidate pool ($n_{initial}$) beyond 25 introduced "Hard Negatives" (e.g., Signal manual rules "poisoning" Hazmat queries), which actually degraded MRR. - **Metadata Filtering:** Validated that source filtering is mandatory in specialized domains to prevent semantic overlap between different regulatory universes. - **Behavioral Fine-Tuning:** Used QLoRA to train the model on a 4-Phase Thinking Process (Audit, Map, Synthesize, Verify) to enforce structural reasoning rather than just factual recall.Evaluation Strategy - **Synthetic Data Generation:** Used a "Teacher" model (Gemini) to generate a technical golden set of questions and reasoning chains from raw manual excerpts. - **LLM-as-a-Judge:** Implemented a secondary high-level auditor to grade the system on Faithfulness, Regulatory Accuracy, and Citation Quality . # Project Roadmap 8-Week LLM Project Roadmap (Provided by Gemini Pro) ### Goal Build a "Rail" RAG system that bridges the gap between raw federal regulations and field-level maintenance queries. ### Personal Goals - Build proficiency in the core architecture of LLMs (e.g., transformers and attention mechanisms) - Develop experience with prompt engineering techniques, including chain-of-thought prompting, Retrieval-Augmented Generation (RAG), fine-tuning of language models, and evaluation methodologies. - Gain experience with Vector Databases and embeddings. - Experiment with Model Fine-tuning. - Experiment with GPU optimization.  # Phase 0: Data and Basic Theory ## Data I've downloaded 5 publicly available FRA PDF's to use as training data. 1. HRGC handbook (2019) 2. Hazardous Material Compliance Manual (2025) 3. Signal & Train Control All Chapters (2012) 4. HRGC & Trespass Prevention: Compliance, Procedures & Program Manual (2019) 5. Track Structures CM (2026) Vol III Ch1: Railroad Workplace Safety General ## Git Repo Any code I wrote is available publicly here : [https://github.com/jac0bmath3w/rail-safety-ai](https://github.com/jac0bmath3w/rail-safety-ai). Please note that any code snippets in this blog may not be up to date. ## Summary of the Paper This is my understanding of the transformers, attention, and what happens inside it. I read the model Architecture (Section 3 of Attention is All You Need paper avaialble [here](https://arxiv.org/abs/1706.03762), referred to as 'the paper' going forward in this blog) and viewed the videos by 3Blue1Brown & Andrej Karpathy to nail in my understanding of it. Videos by 3Blue1Brown - [Transformers, the tech behind LLMs | Deep Learning Chapter 5](https://www.youtube.com/watch?v=wjZofJX0v4M) - [Attention in transformers, step-by-step | Deep Learning Chapter 6](https://www.youtube.com/watch?v=eMlx5fFNoYc) - [How might LLMs store facts | Deep Learning Chapter 7](https://www.youtube.com/watch?v=9-Jl0dxWQs8) Video by Andrej Karpathy - [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY) The transformer has an encoder part as well as a decoder part. But in this project, I'm only interested in the decoder part of the transformer. The steps involved are as follows: 1) We embed the tokens (text being read is broken into tokens). A token could be a word, a part of a word, a special character, etc. Embedding a token means representing a token in a huge dimension. (Say, 512 as given in the paper. It could be more). 2) We add a position component for each token. Otherwise the model doesn't know what token appears where in the string. 3) Attention Block \ - 3.1 Each token has a query and key vector ($Q$ and $K$ respectively) that is obtained by multiplying the $W_q$ and $W_k$ respectively with the embedding. - 3.2 We take the dot product of the $Q$ and $K$ vectors to get a score. If the key of token 1 aligns with the query of token 2, that means the embedding of token 1 "attends" to the embedding of token 2. This results in a large value for the score, essentially meaning token 1 "has something to say about" token 2. - 3.3 We divide this by the square root of the dimension of the embedding to prevent the dot product from growing too large. - 3.4 We mask the score of future words by setting the value to $-\infty$. This is done so the model doesn't "see the answer." For example, in the string "I read the paper and it was interesting," once we reach "I read the paper," we don't want the model to see the next word. (Note: Masking is not done in an encoder). We use $-\infty$ because the softmax function will then turn that value into $0$. - 3.5 We apply the softmax function to get the probabilities. - 3.6 We calculate the value vector by multiplying the embedding with $W_v$ (Value weight matrix). - 3.7 We calculate the weighted sum of the value vector with the score. - 3.8 This is repeated 6 times (according to the original paper), or $n_{layer}$ times sequentially. 4) All the elements in the attention block is repeated in parallel in different heads. We use n_heads parameter to determine how many heads we want. All the results from the n_heads heads is concatenated into one vector. This is multiplied by a final weight matrix to fuse them into a single d_k dimensional vector. 5) Add & Norm Layer \ - 5.1 This result is added to the original embedding to get the modified embedding. This is called the 'add layer' or the 'residual connection' - 5.2 To keep the activations within a consistent range, we add a norm layer. It could be RMSNorm 6) We have a MLP at the next level where each of the embedding vectors is paralelly processed. 7) We add another Add & Norm layer as described before 8) We add a last linear layer 9) Softmax to get the probability Note: Since I am not talking about the encoder here, I skipped the cross attention part. ## Other items ### Temperature This is a parameter you divide the logits by before the softmax in the attention block. $$ \text{Softmax with Temperature} = \frac{e^{\frac{x_i}{T}}}{\Sigma_{n=0}^{N-1}e^\frac{{x_n}}{T}} $$ - Low Temp (0.1): Makes the highest probability even higher (very focused). - High Temp (1.0+): Flattens the distribution (more random/creative). For FRA manuals, you’ll want a low temperature because you want facts, not "creative" safety rules. # Phase 1: Building the ingestion engine, embeddings and vector stores. The latest version of the code would be available at [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/ingest.py](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/ingest.py) ## Ingestion I create a Class called RailDocumentProcessor. I initialize the class using the splitter tool so that the RecursiveCharacterTextSplitter is just called once when an object of the class is initialized (and not for every document). This tool uses parameters chunk_size = 1000, and chunk_overlap = 150. I used chunk_size of 1000 to not lose context of a specific railway rule. Also use an overlap to ensure that any 'bridge' between chunks is not lost. Also note: 'len' is a built in python function that calculates the length of an object. Say we were using tokens, we would update 'len' with 'tokenizer.len'. Why use RecursiveCharacterTextSplitter as opposed to CharacterTextSplitter? RecursiveCharacterTextSplitter splits text intelligently by natural boundaries (paragraphs, sentences ,spaces) to keep semantic context intact. CharacterTextSplitter splits simply by a fixed character count, which often breaks words and ruins context. We use a chunk_size instead of processing the entire document due to context window. LLMs have a limit on how many tokens it can see. The higher the context, the number of parameters that a model has to learn explodes. This class has the process_directory() method. This goes thru each .pdf file in the directory, extracts texts using the pypdf directory, and then creates chunks and creates chunk using the splitter tool. At the same time it saves metadata for each of the chunk including the source (i.e. the filename), the page number, and the length of character for that chunk. ```python import pypdf import os from langchain_text_splitters import RecursiveCharacterTextSplitter class RailDocumentProcessor: def __init__(self, chunk_size=1000, chunk_overlap=150): self.splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len ) def process_directory(self, directory_path): """Processes all PDFs and returns a tuple: (list_of_chunks, list_of_metadatas).""" all_chunks = [] all_metadatas = [] if not os.path.exists(directory_path): print(f"Error: Directory {directory_path} not found.") return [], [] for filename in os.listdir(directory_path): if filename.endswith(".pdf"): print(f"Processing: {filename}") path = os.path.join(directory_path, filename) try: reader = pypdf.PdfReader(path) for i, page in enumerate(reader.pages): page_text = page.extract_text() if not page_text: continue # Create chunks for this specific page page_chunks = self.splitter.split_text(page_text) for chunk in page_chunks: all_chunks.append(chunk) # Enhanced metadata for citations and filtering all_metadatas.append({ "source": filename, "page": i + 1, "char_count": len(chunk) }) except Exception as e: print(f"Could not process {filename}: {e}") return all_chunks, all_metadatas ``` For the 5 PDFs, this extracted 2633 chunks. This was surprising to me to see 'only' ~3000 chunks. But $$\text{Total Characters} \approx \text{Chunks} \times \text{Chunk Size}$$ which gives $$3,000 \times 1,000 = 3,000,000 \text{ characters}$$ Even though the 5 PDF files were approximately 25MB, only about 2MB of it is useful text data. ## Embedding The chunks obtained at the end of the ingestion is sent to the embedder. This is available at [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/embed.py](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/embed.py) ```python from sentence_transformers import SentenceTransformer import torch class RailEmbedder: def __init__(self, model_name='BAAI/bge-base-en-v1.5'): # Check if GPU is available self.device = 'cuda' if torch.cuda.is_available() else 'cpu' print(f"Initializing Embedder on: {self.device}") self.model = SentenceTransformer(model_name, device=self.device) def generate_embeddings(self, text_chunks): embeddings = self.model.encode(text_chunks, show_progress_bar=True, convert_to_tensor=False) return embeddings ``` For this purpose, I defined a class called RailEmbedder and initialized with the sentence transformers model called 'BAAI/bge-base-en-v1.5'. This is available at [https://huggingface.co/BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). Other options for the sentence transformer model available include: all-mpnet-base-v2, all-MiniLM-L6-v2 (people on the internet say this is dated and not to use it). This class has a method called generate_embeddings(), which takes as argument the text_chunks. This creates an embedding for each chunk in dimensions. A matrix of size 2633x768 (768 is the dimension of the sentence transform model BAAI/bge-base-en-v1.5) is obtaned as a return value of generate_embedding. If GPU is available, it uses that else it uses CPU (much slower). Earlier, I mentioned that each token has a vector embedding. Each token is approximately 4 characters ([source](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)). So why do we have 2633 vectors at the end of generate_embeddings instead of 2633 (chunks) X 250 (1000 chunk_size / 4 charaacter per token) X 768? This is because of pooling. What happens during model.encode is 1. Tokenizes the chunk into $N$ tokens. 2. Generates $N$ vectors (one for every token). 3. Collapses those $N$ vectors into one single vector that represents the average meaning of the whole chunk. What does convert_to_tensor argument in model.encode() do?\ - convert_to_tensor=True: \ The model returns a PyTorch Tensor. If you are using a GPU, this tensor remains in the VRAM (Video RAM). This is extremely fast if you are immediately passing that data into another neural network layer (like for training).\ - convert_to_tensor=False: \ The model returns a NumPy array. This data is moved from the GPU back to the System RAM (CPU memory). Vector databases (chromadb is used for this project) expects the data in standard format (NumPy). For storage (which is the next step), I want it to be false. [TODO: ]{style="color: red;"}Add a filter to the RailDocumentProcessor class: “If a chunk is less than 100 characters or contains more than 50% periods/dots (like a TOC), discard it.”\ [TODO: ]{style="color: red;"}Add a list of already processed files to ensure this can scale if more PDFs are added. ## Vector Store This class acts as the "Librarian" of the system, and handles storage, indexing and retrieval. The embeddings generated by RailEmbedder are stored in the disk using this class. I use chromadb library to store these vectors. I created a class RailVectorVault for this purpose available at [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/vector_store.py](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/vector_store.py). I pass the embedder_instance of class RailEmbedder and create a collection. ```python import chromadb class RailVectorVault: def __init__(self, embedder_instance, db_path="./vector_db", collection_name = "rail_safety"): # We pass the embedder IN. This is called 'Dependency Injection'. self.embedder = embedder_instance self.client = chromadb.PersistentClient(path=db_path) self.collection = self.client.get_or_create_collection(name=collection_name) def add_documents(self, chunks, metadatas): # The Vault asks the Embedder to do its job vectors = self.embedder.generate_embeddings(chunks) ids = [f"id_{i}" for i in range(len(chunks))] self.collection.add( documents=chunks, embeddings=vectors.tolist(), metadatas=metadatas, ids=ids ) def query(self, question, n_results=3): """ Performs a semantic search. 1. Embeds the question using the injected embedder. 2. Queries ChromaDB for the closest matches. """ # Embed the query string query_vector = self.embedder.generate_embeddings([question]) # Search the collection results = self.collection.query( query_embeddings=query_vector.tolist(), n_results=n_results ) return results ``` The method add_documents() takes the chunks, embeddings, and the metadata and stores it in drive. The method query 1. Turns the question into a vector. 2. Finds the $n_results$ most similar chunks. 3. Returns a dictionary containing the text and the metadata. # Phase 2: RAG Pipeline ## The Reasoning Engine I defined the class RailSafetyEngine which serves as the reasoning brain of the RAG system. When you query the vector database, it returns n_results number of text that closely match the embeddings that we stored in the database. Think of it as the librarian returning the relevant books that you need to answer the query. Then the brain processes them to return the answer to the question. I use the following imports from the transformers library 1. AutoModelForCausalLM - A class that loads the correct model architecture based on the model that you pass (example llama or gemma) 2. AutoTokenizer - Handles the translation between human text and numerical values that are fed to a machine. 3. pipeline - Wrapper for the tokenization, model execution, and decoding into one function call. 4. BitsAndByesConfig - This helps to compress the model so that it fits in the T4 GPU memory (used in colab). When this class is initialzed, the device is mapped (cpu or gpu), the quantization is set up (bnb_config), loads the model (using scaled dot product attention implementation, shortened as sdpa), and creates the pipeline. 'sdpa' and 4-bit quantization are the "Optimization Duo" that allows a model meant for a $20,000 server to run on a free Google Colab instance. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig import torch class RailSafetyEngine: def __init__(self, model_id="unsloth/Llama-3.2-3B-Instruct"): self.device = "cuda" if torch.cuda.is_available() else "cpu" self.tokenizer = AutoTokenizer.from_pretrained(model_id) # Ensure a padding token exists if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) self.model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa" ) self.pipe = pipeline( "text-generation", model=self.model, tokenizer=self.tokenizer, max_new_tokens=512, temperature=0.1, device_map="auto" ) def generate_answer(self, question, context_chunks): if not context_chunks: return "No relevant safety manual excerpts were found." context_text = "\n\n".join(context_chunks) # 1. Structure the data as a standard list of messages messages = [ { "role": "system", "content": "You are a Senior FRA Safety Consultant. Use ONLY the provided context. Cite Page numbers." }, { "role": "user", "content": f"CONTEXT:\n{context_text}\n\nQUESTION: {question}" } ] # 2. Use the model-agnostic chat template # tokenize=False returns a string; add_generation_prompt=True adds the 'assistant' header prompt = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # 3. Generate outputs = self.pipe( prompt, do_sample=True, eos_token_id=self.tokenizer.eos_token_id, pad_token_id=self.tokenizer.pad_token_id ) # 4. Extract the response # Most tokenizers allow you to split by the prompt itself to get just the new text return outputs[0]["generated_text"][len(prompt):].strip() ``` This class has the generate_answer() method, which takes the question, and the text returned (by the librarian), and returns the answer. Some prompt engineering is involved to make the prompt as specific as possible for the model but at the same time, it is a general wrapper for several questions. The table below shows some iterations that I went thru to get to my current result. | Phase | Problem Encountered | Solution | Result | |-------|---------------------|-----------|--------| |01: The "Dumb" Retriever | The AI kept talking about removing bridges when I asked about creating new crossings. | Metadata Awareness: We moved from flat text to structured chunks that know their File and Page. | Better citations, but still missing key technical data. | | 02: The "Blind" Brain | The AI missed the specific numbers on Page 140 (AADT, speeds) because the search didn't find that page. | Increasing $n_{result}$ (Retrieval Depth): I bumped $n_{result}$ from 3 to 15 to ensure the "Librarian" checked more "shelves." | The model finally "saw" the specific thresholds in the manual. | |03: The "Memory Leak" | The GPU RAM kept climbing to 12GB+ and hanging for 10 minutes. | Quantization & Cache Clearing: We used 4-bit BitsAndBytes and added torch.cuda.empty_cache() to "reset" the brain. | Inference time dropped from 10 mins to under 60 seconds. | | 04: The "Brittle" Prompt | Using hardcoded tags like <start_header_id> meant we couldn't easily switch to newer models like Gemma. | Chat Templating: Switched to model-agnostic templates to allow for easy model swapping.| Portability across different LLMs.| | 05: The "Black Box" | The AI couldn't find the author names (Brent Ogden/Kimley-Horn) because they were on the cover. | Strategic Context: We learned that "Admin" questions need different retrieval strategies (like looking at Page 1). |Higher professional accuracy. | | 06: The Reasoning Gap | The AI's logic engine wasn't strong enough to weigh the "OR" triggers of the prompt | Chain of Thought Prompting | Correct answers to reasoning type questions. | ### Example queries to test current state of the engine. #### Query 1 ``` What enforcement tools are available when enforcing the hazardous materials regulations? ``` #### Answer ``` [THINKING PROCESS] PHASE 1: CONTEXTUAL AUDIT The user question pertains to the enforcement tools available when enforcing the Hazardous Materials Regulations. This falls under the scope of Chapter 10 – Enforcement and Compliance Program in the FRA-Hazardous Materials Compliance Manual. PHASE 2: EVIDENCE MAPPING From the provided manuals, we can extract the following relevant information: * Section 10.3 Enforcement Tools (Source 9, Page 27) * Section 10.3.1 Defect Notices (Source 9, Page 27) * Section 10.3.2 Violation with Civil Monetary Penalties (Source 9, Page 27) * Section 10.3.3 Enhanced Penalties (Source 9, Page 28) * Section 10.3.4 Railworthiness Directives (Source 9, Page 28) * Section 10.3.5 Compliance Orders (Source 9, Page 28) * Section 10.3.6 Emergency Orders (Source 9, Page 28) PHASE 3: SYNTHESIS Based on the extracted information, the available enforcement tools when enforcing the Hazardous Materials Regulations are: 1. Defect Notices 2. Violation with Civil Monetary Penalties 3. Enhanced Penalties 4. Railworthiness Directives 5. Compliance Orders 6. Emergency Orders These tools can be used by FRA inspectors to address noncompliance with the Hazardous Materials Regulations, ranging from less stringent measures like Defect Notices to more severe actions like Emergency Orders. PHASE 4: VERIFICATION All the extracted information is backed by specific citations from the provided manuals. ANSWER The available enforcement tools when enforcing the Hazardous Materials Regulations are: 1. Defect Notices 2. Violation with Civil Monetary Penalties 3. Enhanced Penalties 4. Railworthiness Directives 5. Compliance Orders 6. Emergency Orders ``` #### Query 2 ``` A rural highway-rail grade crossing has these conditions: - posted highway speed: 60 mph - AADT: 18,500 - freight trains per day: 28 - maximum authorized train speed: 70 mph - expected accident frequency with gates: 0.3 per year - vehicle delay: 35 vehicle-hours per day - acceptable alternate access exists within 0.8 mile - closing the crossing would increase the median trip by 2.2 miles Based on the Handbook’s Chapter 3 guidance, which of the following is best supported? A. The crossing should primarily be considered for closure only B. The crossing should primarily be considered for grade separation C. Neither closure nor grade separation is supported D. Both closure and grade separation are supported by the listed criteria ``` #### Answer ``` [THINKING PROCESS] PHASE 1: CONTEXTUAL AUDIT - The query is related to the decision-making process for highway-rail grade crossings. - The provided context is from the FRA-HRGC-Handbook-2019. PHASE 2: EVIDENCE MAPPING - Source 1: Grade separation is recommended for crossings with AADT > 30,000 or when posted highway speed > 55 mph. - Source 2: Grade separation is justified if the benefit-to-cost ratio is < 1.0 or if there are no viable alternatives. - Source 3: Grade separation is considered for crossings with AADT < 1,000, acceptable alternate access, and median trip length not increased by > 2.5 miles. - Source 4: Grade separation is recommended for crossings with train speeds > 40 mph, AADT > 2,000, and multiple lanes of traffic. - Source 5: Grade separation is considered for crossings with high accident frequencies, vehicle delays, and limited alternate access. - Source 6: Closure is the first alternative to consider, followed by grade separation. - Source 7: Closure provides the highest level of crossing safety. - Source 8: Closure is considered for crossings with high train speeds (> 125 mph) and limited sight distances. - Source 9: Vehicles must stop within 50 feet of the crossing before proceeding. - Source 10: Barrier systems are required for high-speed trains (> 110 mph). - Source 11: Closure is considered for crossings with insufficient sight distances and limited alternate access. - Source 12: Sight distances, approach sight distances, and clearing sight distances are critical for determining the need for grade separation. - Source 13: The NCDOT considers various factors when deciding whether to close or improve a crossing. - Source 14: Grade separation is considered for crossings with high accident frequencies and limited alternate access. - Source 15: The NCDOT considers various factors when deciding whether to close or improve a crossing. PHASE 3: SYNTHESIS - Based on the provided criteria, the crossing should be considered for grade separation due to the high AADT, posted highway speed, and maximum authorized train speed. - The presence of freight trains per day and expected accident frequency with gates also support grade separation. - However, the acceptable alternate access and median trip length increase do not strongly support grade separation. - The NCDOT's guidelines and the FRA's regulations support the consideration of grade separation for the given conditions. PHASE 4: VERIFICATION - All points in the synthesis are supported by specific citations from the provided manuals. ANSWER B ``` [TODO: ]{style="color: red;"} Look at Few Shot Prompting (by providing example answer), Self-Correction System (a second pass where the AI reviews its own answer) and Multi-Agent Systems At this point, the retrieval is working (getting appropriate chunks from the vector database) as well as the engine is generating good results. Still it could be improved. The latency of the model with query 2 was around 2 minutes. To get the correct answer, I had to send a massive system prompt with 15 context chunks. This could be brought down. # Phase 3.1 : Fine Tuning There are several avenues where we could fine tune this system. They include tuning the chunk size, the overlap window, engineering the prompt, or the number of chunks retrieved. We could also tune the embedding model (the retriever), the re-ranker, or the generator LLM. - Training the embeddign model. This involves improving recall @ k, precision of first stage retrieval, robustnuss to paraphased question, and domain specific search quality - Training the re-ranker This looks at relevance, rather than just looking at the distance in the vector space. - Training the generator LLM Here we train the model things like how to answer using the retrieved context, citing source chunks, refusing when evidence is insufficient, following a desired answer format, synthesizing across multiple chunks. I'm starting with training the generator first. To do this, we need to create 'training data'. This would involve a list of queries, the thinking process associated with each query, and their answers. Ideally, it would be created by a panel of experts, but that is a costly endeavor. For this project, I'm making API calls to gemini 2.5 to do this for me. To do this I created a RailDataGenerator class available at [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/generator.py](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/src/generator.py). ```python class RailDataGenerator: def __init__(self, vault_instance, api_url, api_key): self.vault = vault_instance self.api_url = api_url #"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-09-2025:generateContent" self.api_key = api_key def _call_teacher(self, system_prompt, user_query): payload = { "contents": [{"parts": [{"text": user_query}]}], "systemInstruction": {"parts": [{"text": system_prompt}]} } # Increased backoff steps to handle stricter 2026 rate limits for delay in [2, 4, 8, 16, 32]: try: # Added a 30s timeout to prevent the script from hanging forever response = requests.post( f"{self.api_url}?key={self.api_key}", json=payload, timeout=30 ) if response.status_code == 200: result = response.json() return result.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', "") elif response.status_code == 429: print(f"Rate limit hit. Retrying in {delay}s...") else: print(f"Teacher API Error {response.status_code}: {response.text}") except requests.exceptions.Timeout: print("Request timed out. Retrying...") except Exception as e: print(f"Request Exception: {e}") time.sleep(delay) return None def generate_training_sample(self, chunk_text, file_name, page_num): """ 1. Grab a random chunk from the vault. 2. Ask Teacher to generate a complex question + reasoning process. """ system_prompt = ( "You are a Senior FRA Rail Safety Expert. Your task is to generate training data " "for a student model. Based on the provided manual excerpt, create a challenging " "technical question and a perfect response following the 4-Phase Thinking Process.\n\n" "PHASE 1: CONTEXTUAL AUDIT\nPHASE 2: EVIDENCE MAPPING\nPHASE 3: SYNTHESIS\nPHASE 4: VERIFICATION\n\n" "Output MUST be in valid JSON format: " "{'question': '...', 'thinking': '...', 'answer': '...'}" ) user_query = f"MANUAL EXCERPT ({file_name}, Page {page_num}):\n{chunk_text}" raw_output = self._call_teacher(system_prompt, user_query) if not raw_output: return None # Clean JSON if model included markdown blocks clean_json = raw_output.replace("```json", "").replace("```", "").strip() try: return json.loads(clean_json) except: return None def create_dataset(self, num_samples=100, output_path="data/training/rail_dataset.jsonl"): samples = [] os.makedirs(os.path.dirname(output_path), exist_ok=True) print(f"Starting generation of {num_samples} sample(s)...") # Get random chunks from collection all_data = self.vault.collection.get() if not all_data or not all_data['documents']: return None total_chunks = len(all_data['documents']) # idx = random.randint(0, len(all_data['documents']) - 1) actual_sample_count = min(num_samples, total_chunks) indices = random.sample(range(total_chunks), actual_sample_count) if num_samples >= total_chunks: print(f"only {total_chunks} sample(s) available, so creating {actual_sample_count} unique sample(s)") for idx in indices:: chunk_text = all_data['documents'][idx] file_name = all_data['metadatas'][idx].get('source', 'Unknown') page_num = all_data['metadatas'][idx].get('page', '?') sample = self.generate_training_sample(chunk_text, file_name, page_number) if sample: samples.append(sample) # Append to file immediately so you don't lose data if it crashes with open(output_path, 'a') as f: f.write(json.dumps(sample) + "\n") print(f"Generated {i+1}/{num_samples}") # MANDATORY COOL-DOWN: # 3 seconds between requests helps stay under the 20 RPM limit time.sleep(3) return output_path ``` Any object of this class is initialized with an object to the RailVectorVault. The create_dataset() attribute of this class creates a JSON Lines file. It retrieves random (without replacement) chunks from the vector db, along with its metadata, and passes an engineered system prompt to an API call to gemini (or any other service). Few other ways to generate the training examples might include - Self-Instruct, where you take 5-10 human written examples and ask the model to generate similar examples of its own. This can be tricky if the model keeps repeating the same logic over and over. - Log-Mining, where you look at the history of how people used the RAG pipeline developed in phase 2, and use the good answers as training data. Here the system must already be in production - Data-Augmentation, where you take high quality Q&A pairs and use an LLM to rewrite it in several different ways by changing tone, adding noise, etc. If using proprietary data, it is imperative to either mask any PII or any information that shouldn't leave a closed ecosystem or use an offline model to generate these samples. Also, if the training data is 100% AI generated, it cannot be copyrighted. So, it is a good idea to mix in a % of Human in the Loop data while generating this training data. This is an example of a generated example ```json {"question": "A maintenance technician inspects an active grade crossing warning system and notes that one of the gate arm lights is illuminated but appears significantly dimmer than the other lights on the same gate arm. During daylight hours, this particular light is difficult to discern from the prescribed approach distance of 200 feet, although it is clearly visible at night. The light unit itself and its wiring are observed to be securely fastened to the gate arm.\n\nBased on FRA regulations under Section 234-29 (implied 234.219), which specific defect classification(s) apply to this situation, and what is the overarching intent of this section that is being violated by this condition?", "thinking": "PHASE 1: CONTEXTUAL AUDIT\n1. **Source Document**: FRA-Signal_Train_Control-2012.pdf, Page 124, Section 234-29 (which corresponds to 234.219 for defect codes).\n2. **Core Subject**: Maintenance, visibility, and securement of gate arm lights and wires at active warning systems.\n3. **Key Requirements/Provisions**: Each gate arm light must be properly visible to approaching highway users (and pedestrians), lights and wires must be securely fastened, and maintenance must adhere to design specifications.\n4. **Specific Defect Classifications**: A1 (burned out/missing light), A2 (defective/not visible/missing light unit), A3 (light unit not secured), A4 (wires not secured), A5 (not maintained per design specifications).\n5. **Intent Statement**: \"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\" This is critical.\n6. **Question Goal**: Create a challenging scenario requiring differentiation between similar defect codes and linking to the section's overarching intent.\n\nPHASE 2: EVIDENCE MAPPING\n1. **Analyze Scenario Elements**: \n * \"illuminated but appears significantly dimmer\": Rules out A1 ('burned out or missing' literally).\n * \"difficult to discern from the prescribed approach distance of 200 feet during daylight hours\": Directly violates the requirement for lights to be \"properly visible to approaching highway users.\" This strongly points to A2 ('not visible') and A5 ('not maintained per design specifications').\n * \"clearly visible at night\": Confirms the light is functional but performance is compromised under specific conditions (daylight), reinforcing a maintenance/design specification issue.\n * \"light unit itself and its wiring are observed to be securely fastened\": Rules out A3 ('Light unit not securely fastened') and A4 ('Light wires not securely fastened').\n\n2. **Map Scenario to Defect Codes**: \n * **234.219.A1 (Gate arm light burned out or missing)**: *Not applicable* because the light is illuminated.\n * **234.219.A2 (Gate arm light unit defective, not visible, or missing)**: *Applicable*. The light is \"not visible\" from the required distance during daylight, and its dimness suggests the unit is \"defective\" in its performance.\n * **234.219.A3 (Light unit not securely fastened)**: *Not applicable* as it's stated to be securely fastened.\n * **234.219.A4 (Light wires not securely fastened)**: *Not applicable* as wiring is stated to be securely fastened.\n * **234.219.A5 (Gate arm light unit not maintained per design specifications)**: *Applicable*. The diminished brightness and failure to be visible under daylight conditions indicate a failure to meet design specifications for performance and maintenance. This is further supported by the section's stated intent.\n\n3. **Identify Overarching Intent**: The excerpt explicitly states: \"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\" The scenario directly demonstrates a failure to meet this, as the light's performance (visibility) has degraded from its specified standard.\n\nPHASE 3: SYNTHESIS\n* **Applicable Defects**: 234.219.A2 (due to being not visible/defective) and 234.219.A5 (due to not meeting design specs for visibility/brightness). \n* **Non-Applicable Defects**: 234.219.A1, A3, A4 (with clear justifications).\n* **Overarching Intent Violation**: Failure to maintain lights in accordance with design specifications, which directly impacts their intended visibility for safety.\n\nPHASE 4: VERIFICATION\n* **Completeness**: All parts of the question (specific classifications, overarching intent) are addressed.\n* **Accuracy**: The classifications and reasoning directly align with the provided text. The distinction between A1 and A2/A5 is carefully made.\n* **Clarity**: The answer is structured logically, explaining why each code applies or doesn't apply.\n* **Challenging Aspect**: The question forces the student to go beyond simple identification (e.g., 'burned out') to interpret 'not visible' and 'not maintained per design specifications' in a nuanced scenario where the light is technically 'on'. It also requires understanding the explicit 'intent' statement.", "answer": "The specific defect classifications applicable to this situation are:\n\n1. **234.219.A2: Gate arm light unit defective, not visible, or missing.**\n * The scenario clearly states that the light, despite being illuminated, is \"difficult to discern from the prescribed approach distance of 200 feet during daylight hours.\" This directly meets the \"not visible\" criterion as required for approaching highway users. Furthermore, the description of it being \"significantly dimmer\" strongly indicates that the light unit itself is \"defective\" in its performance, as its output has degraded below acceptable operating levels.\n\n2. **234.219.A5: Gate arm light unit not maintained per design specifications.**\n * The observed diminished brightness and subsequent lack of visibility under daylight conditions demonstrate a failure to maintain the gate arm light unit to its intended design specifications for luminance and conspicuity. The proper functioning and visibility of the light are fundamental design requirements, and their degradation constitutes a failure in maintenance according to those specifications.\n\nThe following defect classifications are *not* applicable:\n\n* **234.219.A1 (Gate arm light burned out or missing)**: This is not applicable because the light is explicitly stated to be \"illuminated,\" meaning it is neither burned out nor physically missing.\n* **234.219.A3 (Light unit not securely fastened to gate arm)**: This is not applicable as the scenario states the \"light unit itself... [is] observed to be securely fastened.\"\n* **234.219.A4 (Light wires not securely fastened to gate arm)**: This is not applicable as the scenario states the \"wiring [is] observed to be securely fastened.\"\n\nThe **overarching intent** of this section that is being violated is precisely stated in the manual excerpt: **\"The intent of this section is that lights and light wires shall be maintained in accordance with design specifications.\"** The observed condition\u2014a significantly dimmer light that is not adequately visible in daylight\u2014directly contravenes this intent by demonstrating a failure to maintain the gate arm light unit's performance to its designed safety and operational standards for ensuring proper warning to approaching highway users."} ``` The training is done in [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/03_fine_tuning.ipynb](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/03_fine_tuning.ipynb). The basic steps involve 1. loading a pretrained model (Llama-3.2-3B) 2. attach small trainable adapter layers 3. train only those adapters (weight matrices of these adapters) 4. save the newly trained adapters. I use unsloths FastLanguageModel class for loading and preparing LLM for efficient fine tuning. Unsloth is spcifically meant to reduce memory and make Quantized Low Ranking Adapters (QLoRA) more practical on smaller hardwares. In the from_pretrained() function,the most important parameter would be 'load_in_4bit' = True. This compresses the weights of the model agressively, and thus takes less memory on the GPU (with a tradeoff of slightly less numerical precision). This function returns the model (the neural network) and the tokenizer (turns text into tokens). In the next step we do get_peft_model (PEFT = Parameter Efficient Fine Tuning). This is the step where I turn the base model into a QLoRA fine tuned model. The Parameter Efficient part of PEFT means we do not retrain the large model. Only train the extra small set of weights (Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)). The arguments for this function include - r \ This is the LoRA rank. Less the number means less the flexibility of the model. 16 is a balanced choice between lighter models and models that would require much higher memory. - target_modules \ This lists out the part of the transformer where the LoRA adapters are inserted. Here I insert adapters in the attention block and the feed forward block. - lora_alpha \ Scaling factor for the LoRA update. 16 is commonly used. - lora_dropout \ A regularization method. Unsloth recommends 0 for efficiency and speed. - bias \ Train the bias or not. Here it is set as False to only train the weights. This is a standard LoRA setting. - [use_gradient_checkpointing]{style="color: red;"} \ Gradient checkpointing saves GPU memory by not storing every intermediate activation during a forward pass. In the next step, the JSONL file is loaded. This file is structured with 3 columns, 'question', 'thinking', and 'answer'. I wrote a function that takes each line and turns it into a conversation string. This is done as the base model is a instruction/chat model. Instead of 3 separate fields, I have one long text field containing the full conversation (stored in the variable messages). ```python def formatting_prompts_func(examples): instructions = examples["question"] thoughts = examples["thinking"] answers = examples["answer"] texts = [] for instruction, thought, answer in zip(instructions, thoughts, answers): # We wrap the thinking and answer together as the 'Assistant' response full_response = f"[THINKING PROCESS]\n{thought}\n\n[ANSWER]\n{answer}" messages = [ {"role": "system", "content": "You are a Senior FRA Safety Consultant. Use a 4-Phase Thinking Process."}, {"role": "user", "content": instruction}, {"role": "assistant", "content": full_response}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) texts.append(text) return { "text" : texts, } ``` The function apply_chat_template() takes a structured chat message and convert it into the exact string format expected by the model. tokenize = False returns the response as a string (and not tokens), and add_generation_prompt=False means that no additional text is appended (this is already present in messages). I don't tokenize and pass the text to the SFTTrainer (specified in the next step). Unsloth uses its specialized fast kernals which handles padding more efficiently than Pytorch. In the next step, I use the SFTTrainer, and SFTConfig classes from trl (Tranformer Reinforcement Learning) which are high level training helpers. (SFT = Supervised Fine-Tuning). In this step we show the model prompt, and desired response (training data), and teach it to imitate them. The SFTTrainer does 1. prepare the dataset 2. tokenizing text 3. creating batches 4. runnning the forward pass 5. calculate loss 6. backpropagation 7. update trainiable parameters SFTConfig stores the configuration settings of the trainer. One important argument is the data_collator. This component takes a list of dataset examples and turns them into one batch. However, the SFTTrainer tokenized the data but also created the text field. This needs to be removed. The DropTextCollator acts as a gatekeeper and throws away the human readable text. Once the trainer is initialized, trainer.train() is called to complete the training. The below is what was outputted during the training process. Unsloth - 2x faster free finetuning | Num GPUs used = 1 \ Num examples = 101 | Num Epochs = 5 | Total steps = 60 \ Batch size per device = 1 | Gradient accumulation steps = 8 \ Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8 \ Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained) \ Here is an explanation of the text here: - Num examples = 101\ The number of training examples sent - Num Epochs\ The number of full passes thru the entire dataset. With 60 steps, and batch size of 8, it is approximately 5 passes $\frac{60 \times 8}{101}$ - Total steps \ This is the number of times the model updated its weights - Total batch size = 8 \ $$\text{Batch Per Device (1)} \times \text{Accumulation (8)} \times \text{GPUs (1)} = 8$$ - Trainable parameters\ Only the adapter matrices are trained which involves ~24M parameters (as opposed to the 3B) Few other optimizaiton tweaks that were done include: 1. small batch size (per_device_train_batch_size = 1) 2. gradient accumulation (gradient_accumulation_steps = 8) 3. 8 bit optimizer (optim = "adamw_8bit") 4. disabled packing (packign=False) to prevent VRAM spikes. ## Quick Evaluation I tried a new prompt which was not present in the training data to see if it worked well.\ #### Evaluation Query ``` A rail yard manager is overseeing the movement of a tank car containing liquefied petroleum gas (LPG). The car has a small leak at the top valve. The manager wants to move the car to a remote siding for repair without a special permit, arguing it's safer to move it than leave it near the main office. Based on the Hazardous Materials Compliance Manual, is this allowed? ``` #### Answer ``` THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, file page 28, Chapter 4, § 4.6.1-1. 2. Core Subject: Movement of hazardous materials in rail yards. 3. Key Manual Points: rail yards are subject to the HMR; rail yards are considered a rail-yard operation; movement of hazardous materials is subject to the permit system; the permit system applies even if the movement is within the yard. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The car contains LPG, a listed hazardous material. 2. The leak is on the car, not the yard. 3. The proposed movement is within the yard. 4. The page explicitly says the permit system applies to all rail-yard operations involving hazardous materials. PHASE 3: SYNTHESIS - Conclusion: the movement requires a special permit, regardless of the yard location. - Why alternatives are wrong: they treat the yard as a non-permit operation. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the file page number as metadata. ANSWER No. The manual states that rail yards are subject to the permit system, even if the movement is within the yard. It also states that the permit system applies to all rail-yard operations involving hazardous materials. Therefore, the manager must obtain a special permit before moving the car, even if it's within the yard. ``` Note that, - I didn't prompt as I did earlier - The answer adheres to the format in which I trained the model - However, it hallucinates and cited page 28, Chapter 4.6.1-1 (it doesn't exist). Fine-tuning a 3B model teaches it Reasoning, not Knowledge Retrieval. Even after tuning, the model will hallucinate citations if context isn't provided. This proves that Fine-Tuning is not a replacement for RAG, but a way to make RAG more intelligent. In the next phase, I added chunks to the context to see if it helped solve the hallucination problem. # Phase 3.2 : Improving the retriever I passed this question "A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit. Is this allowed?". With this question, I tested using n_results = 5, and compared the results with the source filter on and off. (For this particular question, the answer is in the Hazardous Materials Compliance Manual) ### With source filter: ``` Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 COFC/TOFC Approval A...) Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPROVALS 11.1 Notifica...) Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 10 (Preview: transportation. Section 172.820, Additional planning requirements for transportation by rail, requi...) Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 36 (Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 31 11.2.3 Cylinders in COFC/TOFC Service Cylinders may...) Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when the violation is kno...) THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, Chapter 11.2.2 One-Time Movement Approvals. 2. Core subject: One-time movement approval for non-complying bulk packages. 3. Key manual points: FRA may allow movement of non-complying bulk packages for a special purpose (e.g., repair) under specified conditions ensuring the safety of the rail movement; the approval does not relieve a party from any liability applicable to such movements. 4. Question goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario involves a leaking tank car. 2. The page specifically addresses non-complying bulk packages. 3. The cited approval process is for special purposes like repair. PHASE 3: SYNTHESIS - Conclusion: the movement is allowed under the one-time movement approval process for repair purposes. - Why alternatives are wrong: they assume the repair is not a special purpose. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the fact pattern to test precise application of the manual to the scenario. ANSWER No. The page says that FRA may allow movement of non-complying bulk packages for a special purpose (e.g., repair) under specified conditions ensuring the safety of the rail movement. The cited approval process is for repair purposes, not for transportation. The page also says that the approval does not relieve a party from any liability applicable to such movements. ``` ### Without source filter: ``` Chunk 1: Source: FRA-Signal_Train_Control-2012.pdf, Page: 27 (Preview: arrangements for a tentative inspection. While an unaccompanied inspection is permitted, such ins...) Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 279 (Preview: The provision of exception (2) does not apply to maintenance-of-way work equipment. Such maintenan...) Chunk 3: Source: FRA-Track_Structures_CM_2026_Vol_III_Ch1_Railroad_Workplace_Safety_General.pdf, Page: 8 (Preview: switches and bridges that are moveable via a control machine (by a train dispatcher or control oper...) Chunk 4: Source: FRA-Signal_Train_Control-2012.pdf, Page: 278 (Preview: except signaled sidings, exceeds 20 mph, each hand-operated switch must comply with this section. ...) Chunk 5: Source: FRA-Signal_Train_Control-2012.pdf, Page: 48 (Preview: the requirement. § 236.16 When an inspector finds a main track electric lock releasing circuit t...) THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Signal_Train_Control-2012.pdf, page 27. 2. Core Subject: Limited unaccompanied inspection of signals, switches, and appliances. 3. Key Manual Points: inspectors are limited to visual inspections; no access to tests, relays, circuit plans, etc. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario involves a non-permitted move. 2. The page explicitly limits inspection activity to visual inspections. 3. The page also says such inspection should be arranged at a later date if a railroad officer declines to accompany or fails to provide someone else. PHASE 3: SYNTHESIS - Conclusion: the move is not allowed because it involves a non-permitted action. - Why alternatives are wrong: they ignore the page's explicit limitation on inspection activity and the page's instruction to arrange a later inspection if a permit is not available. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the page's language to resolve the mixed-fact scenario. ANSWER No. The page says that on an unaccompanied inspection, the inspector's activities are limited to making visual inspections of signals, switches, and appliances. The page also says that such inspection should be arranged at a later date if a railroad officer declines to accompany or fails to provide someone else to accompany the inspector. The cited page does not permit the unaccompanied movement of a non-permitted item like a leaking LPG tank car. ``` I see that without a source filter, the chunks retrieved are not even from the Hazardous Materials Compliance Manual. This tells me that we need to improve the retriever. To fix this, I updated the RailVectorVault class and added the following two methods. - hybrid_query() - rerank_query() ### The Metrics - Hit@1 : This is the most critical metric. It tells you how often the very first result was the correct chunk. At 0.467, the Reranker is getting the exact right page nearly 47% of the time. - Hit@3 / Hit@5 : These tell you how often the correct answer was "in the top 3" or "in the top 5." The Reranker has a 72.4% Hit@5. This means that if you feed 5 chunks to the LLM, it has the "truth" in front of it 72% of the time. - MRR (Mean Reciprocal Rank): This is the average of $1 / \text{Rank}$. If the answer is at #1, you get 1.0.If it's at #2, you get 0.5. The Reranker MRR of 0.567 means that, on average, the correct answer is sitting between Rank 1 and Rank 2.   | |hit@1| hit@3|hit@5 |mrr | |-------|-----|------|-------|----------| |Vector |0.216| 0.317| 0.378| 0.284| |Hybrid |0.250 |0.409| 0.508| 0.362|  |Rerank |0.467| 0.663| 0.724| 0.568|  |Rerank with BGE | 0.416| 0.665| 0.734| 0.544|  |Rerank with BGE and rerank instructions | 0.520| 0.720| 0.760| 0.619|  |Rerank with BGE and rerank instructions (increase n_initial to 50) |0.429 | 0.703| 0.781| 0.569|  By moving from simple Vector search to a Hybrid-Rerank architecture, we increased our Hit@1 accuracy by 116% (from 21% to 46%). While the retrieval takes longer, it ensures the LLM is reasoning over the correct safety regulations.   The metrics for Rerank with BGE model went down. However, when I added a rerank instruction ("Query" and "Passage") the Hit@1 jumped from 0.416 to 0.520. That is a 25% relative improvement just by changing a few characters of text. A Hit@5 of 0.760 means that for 3 out of every 4 safety questions, the LLM will have the correct regulation in its immediate context. An MRR of 0.62 confirms that the correct answer is almost always the very first or second thing the model suggests. I tried rerank with an increased n_initial as well which reduced the MRR. Hit@5 went UP ($0.760 \rightarrow 0.781$). This proves that increasing n_initial worked as intended for Recall. However, I increase the number of initial retrieved samples to 50 or 100, we are inviting "Hard Negatives" into the room. Based on this we can make a decision on the n_initial parameters as follows. - If the goal is LLM Accuracy: Stick with the pool size that yields the highest Hit@5, because the fine-tuned Llama model is smart enough to ignore the 4 "distractor" chunks if the "Gold" chunk is present. - If the goal is Retrieval Efficiency: Stick with the n_initial that yields the highest MRR (which seems to be $n=25$ in this case). # Phase 4: Reasoning-Augmented RAG In this step, I added the retrieved chunks for each query and added it to the context. This is available at [https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/04_ra_rag.ipynb](https://github.com/jac0bmath3w/rail-safety-ai/blob/main/notebooks/04_ra_rag.ipynb). The function below takes the question (prompt), the vault (or the vector store), along with other arguments (n_results to select how many chunks we want the query to retrieve, source_filter to filter which source we want the chunks from, and show_context to help in debugging by showing the retrieved contexts). ```python def run_integrated_audit(question, vault, n_results = 10, source_filter = None, show_context = False): search_params = {"n_results": n_results} if source_filter: search_params["where"] = {"source": source_filter} query_vector = vault.embedder.generate_embeddings([question]) query_list = query_vector.tolist() if hasattr(query_vector, 'tolist') else query_vector results = vault.collection.query( query_embeddings=query_list, **search_params ) if show_context: print(f"--- LIBRARIAN REPORT ({len(results['documents'][0])} chunks found) ---") context_parts = [] if results['documents']: for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])): if show_context: print(f"Chunk {i+1}: Source: {meta['source']}, Page: {meta['page']} (Preview: {doc[:100]}...)") context_parts.append(f"[SOURCE: {meta['source']}, PAGE: {meta['page']}]\n{doc}") print("-" * 40) # # Extract text and metadata for the prompt # context_parts = [] # for doc, meta in zip(results['documents'][0], results['metadatas'][0]): # context_parts.append(f"[SOURCE: {meta['source']}, PAGE: {meta['page']}]\n{doc}") context = "\n---\n".join(context_parts) messages = [ {"role": "system", "content": "You are a Senior FRA Safety Consultant. Use your 4-Phase Thinking Process. Answer ONLY based on the provided context."}, {"role": "user", "content": f"CONTEXT FROM MANUALS:\n{context}\n\nQUESTION:\n{question}"}, ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( input_ids=inputs, max_new_tokens=1024, use_cache=True, temperature=0, do_sample=False ) response = tokenizer.batch_decode(outputs, skip_special_tokens=True) return response ```  Rerunning with reranker retriever. test_question = "A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit. Is this allowed?" ### Feeding only 5 retrieved elements to the chunk. #### With Source Filter ``` --- LIBRARIAN REPORT: Question 1 (5 chunks via RERANK) --- - Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR... - Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C... - Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_525) Preview: investigation and decision-making efforts on issues related to tank car safety. ... - Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t... - Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_572) Preview: applicable to such movements. Additionally, an OTMA approval issued by FRA is on...Both `max_new_tokens` (=1024) and `max_length`(=131072) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation) PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, § 11.2.2 COFC/TOFC Approval. 2. Core Subject: Movement of non-complying bulk packages by rail within the United States. 3. Key Manual Points: COFC/TOFC movement is allowed only under conditions specified in 49 C.F.R. § 174.63 or approved by the Associate Administrator for Safety, FRA. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario lacks a permit. 2. The page says permits are required for such movements. 3. The cited section also says the approval is only applicable to movements within the United States. PHASE 3: SYNTHESIS - Conclusion: the movement is not allowed without a permit. - Why alternatives are wrong: they ignore the cited permit requirement. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the manual's language exactly. ANSWER No. The page says that movement of a bulk packaging (e.g., portable tank, IM portable tank, IBC, Large Packaging, cargo tank, or multi-unit tank car tank) containing a hazardous material in container-on-flatcar (COFC) or trailer-on-flatcar (TOFC) service is allowed only under conditions specified in 49 C.F.R. § 174.63 or approved by the Associate Administrator for Safety, FRA. The cited section also says the approval is only applicable to movements of a nonconforming bulk package/packaging by rail within the United States. The cited page does not allow the movement without a permit. ``` #### Without Source Filter. ``` --- LIBRARIAN REPORT: Question 1 (5 chunks via RERANK) --- - Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR... - Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 276 (ID: id_2304) Preview: 236-108 points is occupied. The exception to the traffic locking requireme... - Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C... - Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t... - Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta... THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Chapter 11, page 35, § 11.2 One-Time Movement Approvals. 2. Core Subject: One-time movement approval for non-complying bulk packages for special purposes. 3. Key Manual Points: FRA may grant one-time movement approval for special purposes under specified conditions; the approval does not relieve liability for the movement; the approval is only for non-complying bulk packages. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario involves a leaking tank car. 2. The repair is in a remote siding. 3. The page says the approval is for non-complying bulk packages for special purposes. 4. The page also says the approval does not relieve liability. PHASE 3: SYNTHESIS - Conclusion: the one-time movement approval is not a blanket permission to move any leaking tank car. - Why alternatives are wrong: they treat the repair as a general justification for movement, ignoring the page's specific limitation. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the manual's language exactly. ANSWER No. The page says one-time movement approval is available for non-complying bulk packages for special purposes under specified conditions, not for any leaking tank car. The approval does not relieve the carrier from liability for the movement. The page specifically limits the approval to non-complying bulk packages. ``` Even though both versions correctly answered "No," the first version was "smarter" (citing § 174.63) while the second version was more "general." 1. The "174.63" Disappearing \ In the Filtered run, the Librarian gave the Analyst five high-quality Hazmat chunks. Chunk 5 (ID 572) was included in the context. Because the Analyst had that specific data point, it could provide a "High-Resolution" answer. In the Unfiltered run, look at Chunk 2 (ID 2304): Source: FRA-Signal_Train_Control-2012.pdf, Page: 276. This Signal Manual chunk is not relevant for the question asked. Because it took up a slot in the context window, it displaced Chunk 5 (ID 572). Without ID 572, the Analyst lost the specific knowledge of § 174.63. It had to rely on the remaining chunks (like Chapter 11.2) to find a reason to say "No." 2. The Success of Phase 3 (Reasoning) The impressive thing here is that both answers are still correct. Even when the retriever made a mistake and brought a Signal book to the desk, the LLM was trained well enough to: - Ignore the irrelevant Signal data. - Synthesize a "No" based on the general OTMA rules it did have. - Maintain the 4-Phase Thinking Process perfectly. The fact that the Signal manual (ID 2304) is still showing up in your Top 2 of the Unfiltered Rerank is the proof of why you need the source_filter or a stronger Hybrid weight. On looking at the two interesting chunks, the reason why 2304 was picked (with no source filter) becomes clearer. This is id_572 ``` applicable to such movements. Additionally, an OTMA approval issued by FRA is only applicable to movements of a nonconforming bulk package/packaging by rail within the United States. 38 The OMTA process also provides an informational database that can identify systemic problems with a particular series of tank car or valve, as well as documents the root cause and contributing factors of defective conditions. The information obtained as a result of the movement approval process may identify facilities that have procedural problems and require assistance in ensuring that their practices, in regard to package preparation for transportation, are adequate to ensure regulatory compliance and safety. Further guidance on the OTMA process is explained in the current version of Hazardous Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 COFC/TOFC Approval {'char_count': 905, 'source': 'FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf', 'page': 35, 'type': 'technical_manual'} ``` This is id_2304 ``` 236-108 points is occupied. The exception to the traffic locking requirements (§ 236.405) applies only in instances when a train is left on the main track while its engine and/or cars move into an adjacent siding or yard for switching purposes and must, in returning to its train, reverse its direction for a short distance. It is permissible in such instances to permit such movements to be made with a signal aspect indicating “Proceed at Restricted Speed” into the occupied block. CLASSIFICATION OF DEFECTS 236.405.A1 On track signaled for movements in bot h directions, occupancy of track between opposing signals at adjacent controlled points does not prevent changing the direction of traffic from that which obtained at the time the track became occupied. (Note: An exception added January 24, 1966, permits display of an aspect not less restrictive than that indicating “proceed at restricted speed” by a {'type': 'technical_manual', 'source': 'FRA-Signal_Train_Control-2012.pdf', 'char_count': 938, 'page': 276} ``` 1. [Word Overlap]{.underline}\ The test question was: "A rail yard manager wants to move a leaking LPG tank car to a remote siding for repair without a permit." Now, look at the vocabulary density in the Signal chunk (id_2304): \ - "move into an adjacent siding or yard" (Direct hit on "move", "siding", and "yard"). - "for switching purposes" (Semantically close to the intent of moving a car). - "permissible in such instances to permit such movements" (Direct hit on "permissible/allowed" and "permit" and "movements"). To a vector model, id_2304 looks like a perfect match because it describes the exact physical action the manager wants to take: moving a car into a siding. 2. [Semantic Gap]{.underline}\ Now, look at the Hazmat chunk (id_572) that we wanted the model to prioritize:\ - It uses terms like "nonconforming bulk package/packaging" instead of "leaking tank car." - It uses "OTMA approval" instead of "permit." - It doesn't mention the word "siding" or "yard" at all. Because id_572 is written in "Regulatory Speak" and your question was written in "Operational Speak," the distance-based search (Vector) and the keyword search (BM25) both found id_2304 more compelling than id_572. 3. [Why the Reranker Struggled]{.underline}\ The Signal chunk was a "Hard Negative." It contains almost all the keywords of the question but belongs to a completely different regulatory universe. - Vector Search saw: "Move," "Siding," "Yard," "Permit."\ - BM25 saw: "Move," "Siding," "Yard," "Permit."\ - The Result: Both "Librarians" put the Signal chunk at the top of the pile because it had a higher "keyword density" for the operational parts of the test question. ### Feeding 10 retrieved elements to the chunk. This caused a context window overflow. I updated the max_seq_length of the model to 4096 to be able to pass 10 chunks as contexts into the LLM. #### With source filter n_result = 10 ``` --- LIBRARIAN REPORT: Question 1 (10 chunks via RERANK) --- - Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR... - Chunk 2: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website.11.2.2 C... - Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_525) Preview: investigation and decision-making efforts on issues related to tank car safety. ... - Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t... - Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_572) Preview: applicable to such movements. Additionally, an OTMA approval issued by FRA is on... - Chunk 6: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 36 (ID: id_575) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 31 11.2.3 Cylinders in COFC/TOFC Se... - Chunk 7: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta... - Chunk 8: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_523) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 13 3.4 Responsibilities of Tank Car Te... - Chunk 9: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 11 (ID: id_504) Preview: • Part 174 – Carriage by Rail • Part 178 – Specifications for Packagings • Par... - Chunk 10: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_524) Preview: organizations for tank car safety initiatives. • Provides subject matter expert... THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, page 35, Chapter 11.1 Notifications. 2. Core Subject: Notification requirements for certain tank car movements. 3. Key Manual Points: the shipper must notify FRA of non-received tank cars within 20 days; the notification may be made by email or phone; the movement is only allowed if the car is repaired or replaced within 20 days. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario involves a tank car that is leaking. 2. The scenario involves movement without a permit. 3. The manual specifically addresses notification for non-received tank cars. PHASE 3: SYNTHESIS - Conclusion: the movement is not allowed without a permit. - Why alternatives are wrong: they ignore the notification requirement. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the page number as metadata. ANSWER No. The manual says that the shipper must notify FRA of non-received tank cars within 20 days from the date of shipment. The page also says that the movement is only allowed if the car is repaired or replaced within 20 days. The cited page does not allow the movement to proceed without a permit. ``` #### Without source_filter n_result = 10 ``` --- LIBRARIAN REPORT: Question 1 (10 chunks via RERANK) --- - Chunk 1: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_571) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 30 CHAPTER 11 – NOTIFICATIONS AND APPR... - Chunk 2: Source: FRA-Signal_Train_Control-2012.pdf, Page: 276 (ID: id_2304) Preview: 236-108 points is occupied. The exception to the traffic locking requireme... - Chunk 3: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 35 (ID: id_573) Preview: Materials Guidance (HMG)-127 available for review on FRA’s website. 11.2.2 C... - Chunk 4: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 33 (ID: id_564) Preview: penalty schedule amount. FRA may also pursue aggravated maximum penalties when t... - Chunk 5: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 26 (ID: id_545) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 21 7.2.6 Tank Car Owner Inspection Ta... - Chunk 6: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 18 (ID: id_523) Preview: HAZARDOUS MATERIALS COMPLIANCE MANUAL 13 3.4 Responsibilities of Tank Car Te... - Chunk 7: Source: FRA-Signal_Train_Control-2012.pdf, Page: 279 (ID: id_2319) Preview: The provision of exception (2) does not apply to maintenance-of-way work equipme... - Chunk 8: Source: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Page: 25 (ID: id_543) Preview: (e.g., classification, training, OTMA implementation, package securement). Shipp... - Chunk 9: Source: FRA-HRGC-Handbook-2019.pdf, Page: 171 (ID: id_1088) Preview: Solutions to Blocked Crossings Appendix C to the Noteworthy Practices Guide add... - Chunk 10: Source: FRA-Signal_Train_Control-2012.pdf, Page: 278 (ID: id_2313) Preview: except signaled sidings, exceeds 20 mph, each hand-operated switch must comply w... THINKING PROCESS PHASE 1: CONTEXTUAL AUDIT 1. Source Document: FRA-Hazardous_Materials_Compliance_Manual_01.07.25.pdf, Chapter 7.2.6 Tank Car Owner Inspection. 2. Core Subject: Tank car owner inspection and repair of leaking tank cars. 3. Key Manual Points: the tank car owner is responsible for ensuring the tank car is in compliance with the functions listed; the inspection includes classification, training, OTMA implementation, package securement, and safety and security; the inspection is comprised of four general regulatory elements. 4. Question Goal: Test precise application of the manual to a mixed-fact scenario. PHASE 2: EVIDENCE MAPPING 1. The scenario involves a leaking tank car. 2. The page lists four inspection elements. 3. The repair is on a remote siding, not at the facility. PHASE 3: SYNTHESIS - Conclusion: the repair must be done under the inspection program, not outside of it. - Why alternatives are wrong: the page lists inspection as a regulatory function, not repair. PHASE 4: VERIFICATION The response relies only on the cited manual page(s) and uses the cited manual points as the basis for the conclusion. ANSWER No. The page says the tank car owner is responsible for ensuring the tank car is in compliance with the functions listed, including classification, training, OTMA implementation, package securement, and safety and security. The inspection is comprised of four general regulatory elements. The page does not allow repair outside the inspection program. The cited page says the inspection is comprised of four general regulatory elements: Special Permits, registration, training, document retention, package selection, as well as safety and security. Repair of a leaking tank car is one of the functions the page says the owner must perform. ``` In this case, more data made the model "dumber." In the 5-chunk run, the "Signal-to-Noise Ratio" was high. The model had fewer options, so it had to look more closely at the text (e.g. OTMA (One-Time Movement Approval) text.) In the 10-chunk run, the Information Density dropped. ## Path Forward 1. Revert to n_results = 5: This was your "Sweet Spot." It provided enough evidence for Hit@5 accuracy without overwhelming the model's attention. 2. Trust the Reranker: Since your Reranker Hit@1 is 0.52, you should trust that the top 5 results are significantly better than the next 5. Adding results 6 through 10 is clearly introducing "distractor" logic. 3. Use metadata filter when possible. # Evaluation The RA-RAG system is subjected to evaluation using a bigger model. LLM as a judge. I used the following rubrics.\ 1. FAITHFULNESS (1-5): Is the answer derived ONLY from the context provided in the thinking process? (1 = Hallucinated/Used external knowledge, 5 = Perfectly Grounded) 2. REGULATORY ACCURACY (1-5): Compare the AI Answer to the Ground Truth Reference. Does the logic match? (1 = Dangerous/Incorrect, 5 = Expert accuracy) 3. CITATION QUALITY (1-5): Did the model cite specific Pages/Sections correctly within the final answer as per the Thinking Process? [TODO: ]{style="color: red;"} Complete this. # Summary of Lessons Learned 1. Without RAG, the LLM would depend on the entire context it was trained on, and only know until the training cutoff date. RAG allows us to give the LLM context that it needs to answer questions. Without RAG, an LLM can hallucinate and generate text that looks like what it is supposed to look like. 2. Tweaking the retriever gives better benefit as if we can't get the right context to the engine, it will not be able to provide answers 3. Hybrid search and metadata filtering can be very important for professional applications. Keywords acts as a hard filter, 4. We can train the engine to reason the way we want it to based on training examples. But fine tuning is for behavior, not facts. 5. As well as the size of the model, the type of info we feed to the model and the way we feed it in can have an impact. 6. More context doesn't necessarily mean better answer. # TODO 1. Try different chunk size, overlap. Try MarkdownHeaderTextSplitter, SemanticChunker, SpacyTextSplitter, NLTKTextSplitter, TokenTextSplitter. 2. [DONE] Try bge-reranker-v2-m3 as a reranker model. 3. Fine tune the retriever using training examples. 3. Use vectorless RAG. 4. Multimodal RAG (what if we have PDF and Images?) 5. Ability to filter thru different elements of metadata.