Thesis Falsifier

A tool to aid researchers in assessing whether research papers adhere to scientific best practices. This application uses AI to automatically generate falsification forms, helping researchers verify the scientific robustness of their work across disciplines including social sciences and natural sciences.

The application emphasizes key scientific principles such as:

The falsifiability of hypotheses
The adherence to a clear and rigorous methodology
Drawing conclusions strictly supported by empirical findings

Features
Architecture
Prerequisites
Quick Start
Installation
Usage
Openrouter API
Docker Setup
Configuration
Development
Evaluation
FAQ
Contributing

Features

PDF Processing: Automatically extracts and analyzes text from research papers
AI-Powered Analysis: Uses large language models to assess scientific rigor
Comprehensive Evaluation: 19-point falsification form covering research questions, methods, results, conclusions, and ethics
Web Interface: User-friendly Gradio interface for easy interaction
GPU Acceleration: Optimized for NVIDIA GPUs with CUDA support
Docker Support: Easy deployment and sharing across different environments
PDF Report Generation: Downloadable assessment reports

Architecture

PDF → Text → Chunks → Embeddings → FAISS Index
                                    ↓
Question → Embedding → FAISS Search → Relevant Chunks

The application follows a RAG (Retrieval-Augmented Generation) architecture:

Document Processing: PDFs are parsed and chunked into manageable pieces
Vector Embedding: Text chunks are converted to vector embeddings using Nomic embeddings
Semantic Search: FAISS index enables fast similarity search for relevant context
AI Generation: LLM generates falsification assessments based on relevant context

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA support (recommended 8GB+ VRAM)
Docker (optional, for containerized deployment)
Ngrok account (optional, for public sharing)
Openrouter API (for LLM access, if no access to GPU)

Quick Start

Using Docker (Recommended)

# Build and run (using deployment script)
./start_app.sh

# Or manually:
docker build -t falsification-form -f deployment/Dockerfile .
docker run --gpus all -p 7860:7860 falsification-form

# Access at http://localhost:7860

Using Local Installation

# Install dependencies
uv sync --all-extras

# Run the application
uv run frontend/gradio_ui.py

Installation

Method 1: Docker (Recommended for New Users)

Clone the repository:

git clone <repository-url>
cd falsification_form

Build and run with Docker:

docker build -t falsification-form -f deployment/Dockerfile .
docker run --gpus all -p 7860:7860 falsification-form

Access the application at http://localhost:7860

Method 2: Local Development Setup

Clone the repository:

git clone <repository-url>
cd falsification_form

Install dependencies:
```
uv sync --all-extras
```
Activate the virtual environment (optional - if you prefer not using uv run):
```
source .venv/bin/activate
```
Run the application:
```
uv run frontend/gradio_ui.py
```

Usage

Basic Usage

Upload a PDF research paper through the web interface
Click "Generate Falsification Report" to start the analysis
Review the generated assessment - the system evaluates 19 different aspects of scientific rigor
Download the PDF report for offline review

Command Line Interface

For batch processing or integration:

# Process a single PDF file
uv run app.py --pdf <path_to_pdf> --out <output_file.json>

Package Management

To install all dependencies (including dev dependencies):

uv sync --all-extras

To run individual files:

uv run app.py

Or alternatively:

# Activate the virtual environment and run normally
uv shell
python app.py

Docker Setup

Prerequisites

Before running the Docker container, you need to set up Ollama on your host machine:

Install Ollama:

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Start Ollama server:
```
ollama serve
```

Pull required models (in a new terminal):

# Pull the main language model
ollama pull gpt-oss:20b

# Pull the embedding model
ollama pull nomic-embed-text

Quick Start (Local Only)

# Build the Docker image
docker build -t falsification-form -f deployment/Dockerfile .

# Run the application (make sure Ollama is running on host)
docker run --gpus all -p 7860:7860 --network host falsification-form

The application will be available at http://localhost:7860

Note: The --network host flag allows the Docker container to access Ollama running on the host machine at localhost:11434.

Alternative: Port Mapping (if not using host network)

# If you prefer port mapping instead of host network
docker run --gpus all -p 7860:7860 -p 11434:11434 falsification-form

Complete Startup Sequence

Start Ollama server (in terminal 1):
```
ollama serve
```

Pull models (in terminal 2):

ollama pull gpt-oss:20b
ollama pull nomic-embed-text

Run Docker container (in terminal 3):

docker run --gpus all -p 7860:7860 --network host falsification-form

Access the application at http://localhost:7860

Shutdown Instructions

To properly shut down the entire system:

Stop the Docker container:

# If running in foreground: Ctrl+C
# If running in background:
docker stop <container_name_or_id>

Stop Ollama server:

# In the terminal where ollama serve is running: Ctrl+C
# Or kill the process:
pkill ollama

Stop ngrok (if running):

# In the terminal where ngrok is running: Ctrl+C
# Or kill the process:
pkill ngrok

Troubleshooting

Connection Error: If you see "Failed to connect to Ollama", ensure:

Ollama is installed and running (ollama serve)
Required models are downloaded (ollama list to check)
Docker container can reach host Ollama (use --network host or proper port mapping)

Model Not Found: If models aren't available:

# Check available models
ollama list

# Pull missing models
ollama pull gpt-oss:20b
ollama pull nomic-embed-text

Openrouter API

Setup a free API key at Openrouter
Create a .env file to store the key The line in the .env file should look like:

OPENROUTER_API_KEY=sk-or-<your_key_here>

Make sure that the dev_settings.yaml file is set to "api"

Persistent Deployment

For long-running deployments that persist across SSH disconnections, you can use tmux sessions. This approach keeps your application running even when you close terminal windows or SSH connections.

Prerequisites

Install tmux: sudo apt install tmux (Ubuntu/Debian) or brew install tmux (macOS)
Ollama running: Ensure Ollama is configured as a systemd service and running (ollama serve)
Models downloaded: Verify required models are available (ollama list)

Quick Setup with Tmux

Option 1: Use the provided startup script

# Make the script executable
chmod +x start_app.sh

# Start the application
./start_app.sh

Option 2: Manual tmux setup

# Start Docker container in tmux
tmux new-session -d -s falsification-app
tmux send-keys -t falsification-app "docker run --gpus all -p 7860:7860 --network host falsification-form" Enter

# Wait for app to start, then start ngrok
sleep 10
tmux new-session -d -s ngrok
tmux send-keys -t ngrok "ngrok start gradio" Enter

Managing Tmux Sessions

List and attach to sessions:

# List all sessions
tmux list-sessions

# Attach to view the app session
tmux attach-session -t falsification-app

# Attach to view ngrok session
tmux attach-session -t ngrok

# Detach from a session: Ctrl+B, then D

Check status without attaching:

# Check if app is running
curl http://localhost:7860

# Check ngrok tunnels
curl http://localhost:4040/api/tunnels

# View recent output from sessions
tmux capture-pane -t falsification-app -p
tmux capture-pane -t ngrok -p

Stop sessions:

# Kill sessions when done
tmux kill-session -t falsification-app
tmux kill-session -t ngrok

Deployment Checklist

Before starting persistent deployment:

Ollama is installed and running as a systemd service
Required models are downloaded (ollama pull gpt-oss:20b and ollama pull nomic-embed-text)
Docker image is built (docker build -t falsification-form -f deployment/Dockerfile .)
Ngrok is configured (if using public access)
tmux is installed

Monitoring and Maintenance

Check application health:

# Test application endpoint
curl -f http://localhost:7860 || echo "App is down"

# Check Docker container status
docker ps | grep falsification-form

# Check Ollama status
curl -f http://localhost:11434/api/tags || echo "Ollama is down"

View logs:

# Docker container logs
docker logs falsification-app

# View tmux session output
tmux capture-pane -t falsification-app -p

Restart the application:

# Kill and restart the app session
tmux kill-session -t falsification-app
tmux new-session -d -s falsification-app
tmux send-keys -t falsification-app "docker run --gpus all -p 7860:7860 --network host falsification-form" Enter

Public Access with Ngrok (Optional)

If you want to share your application publicly:

Create ngrok account: Go to https://ngrok.com/ and sign up
Get your auth token from the ngrok dashboard
Configure ngrok:

Option A - Run ngrok from inside container:

Ngrok is already installed in the container, so no need to download.

Configure the ngrok file:
- Linux: ~/.config/ngrok/ngrok.yml
- MacOS: ~/Library/Application Support/ngrok/ngrok.yml
- Windows: %HOMEPATH%\AppData\Local\ngrok\ngrok.yml
You may copy-paste the deployment/ngrok.yml.template file and fill in the three variables: NGROK_AUTH_TOKEN, USERNAME:PASSWORD.

With that configured, you may now run:
```
# In a new terminal, access the running container
docker exec -it <container_name> bash
   
# Start tunnel
ngrok start gradio   # or simpler (but without authentication): ngrok http 7860
```
Option B - Run ngrok from host machine:

First install ngrok on your host machine. Then configure the ngrok file:
- Linux: ~/.config/ngrok/ngrok.yml
- MacOS: ~/Library/Application Support/ngrok/ngrok.yml
- Windows: %HOMEPATH%\AppData\Local\ngrok\ngrok.yml
You may copy-paste the deployment/ngrok.yml.template file and fill in the three variables: NGROK_AUTH_TOKEN, USERNAME:PASSWORD.

With that configured, you may now run:
```
# Run (while container is running)
ngrok start gradio   # or simpler (but without authentication): ngrok http 7860
```

The ngrok dashboard will show you the public URL you can share.

Complete Shutdown with Ngrok:

# 1. Stop ngrok (Ctrl+C in ngrok terminal)
# 2. Stop Docker container (Ctrl+C or docker stop)
# 3. Stop Ollama (Ctrl+C in ollama terminal)
# 4. Verify all processes are stopped:
ps aux | grep -E "(ollama|ngrok|docker)"

Configuration

Model Settings

The application uses configurable AI models. Edit dev_settings.yaml:

model:
  model: "gpt-oss:20b"
  temperature: 0.7  # Controls randomness in generation
  max_tokens: 1024
  n_gpu_layers: -1  # Offload all possible layers to GPU
  n_ctx: 8192       # Context window size

embedder:
  model_name: "nomic-embed-text"
  device: "gpu"

debug:
  verbose: true
  log_prompts: true

When doing this, make sure that the LLM-model you configure have been pulled with Ollama, ollama pull <model_name>. You can verify the models you have downloaded on the host machine with ollama ls.

Changing the Model

To use a different model, modify the model field in dev_settings.yaml. The application supports various Ollama-compatible models.

Development

Project Structure

├── deployment/       # Deployment scripts and configuration
│   ├── Dockerfile    # Container configuration
│   ├── docker-entrypoint.sh # Container startup script
│   ├── start_app.sh  # Application startup script
│   ├── kill_app.sh   # Application shutdown script
│   └── ngrok.yml.template # Ngrok configuration template
├── frontend/          # Gradio web interface
│   └── gradio_ui.py   # Main web UI
├── models/           # AI model interfaces
│   └── interface.py  # Model abstraction layer
├── rag/             # Retrieval-augmented generation
│   └── retriever.py # FAISS search and chunking
├── utils/           # Utility functions
│   ├── config.py    # Configuration loading
│   ├── pdf_generator.py # PDF report generation
│   └── log_utils.py # Logging utilities
├── prompts/         # AI prompt templates
│   ├── form_questions.json # Evaluation questions
│   ├── system_prompt.txt   # System instructions
│   └── templates/   # Prompt templates
├── assets/          # Static resources and assets
│   ├── fonts/       # Font files for PDF generation
│   └── research_articles/ # Research papers and related documents
├── notebooks/       # Development notebooks
├── cheatsheets/     # Reference materials
├── logs/           # Application logs
├── start_app.sh    # Wrapper script for deployment/start_app.sh
└── kill_app.sh     # Wrapper script for deployment/kill_app.sh

Running Tests

uv run pytest

Code Quality

uv run ruff check .
uv run mypy .

Evaluation Questions

The application evaluates research papers across 11 key questions organized into 3 sections:

Section 1: Research Summary (4 questions)

1.1: Primary experimental hypothesis identification
1.2: Research method and design description
1.3: Key findings summary with statistical results
1.4: Authors' primary conclusion and its relationship to hypothesis

Section 2: Critical Analysis (4 questions)

2.1: Falsifiability assessment of the hypothesis
2.2: Consistency evaluation between hypothesis and research design
2.3: Internal and construct validity critique
2.4: Logical connection evaluation from findings to conclusions

Section 3: Overall Assessment (3 questions)

3.1: Concise holistic summary of the research paper
3.2: Identification of the most significant strength and weakness
3.3: Concrete methodology improvement proposal

Evaluation

The repository includes a comprehensive evaluation framework that leverages DeepEval's LLM-as-a-judge paradigm to assess the quality of generated falsification form answers. The framework supports two distinct evaluation strategies:

One-by-one evaluation: Compares generated answers against golden reference answers for each question individually
All-at-once evaluation: Evaluates answers against broader commentary or critical reviews of research papers

The falsification form consists of 11 questions, and effective evaluation requires ground truth data for comparison. When specific golden answers aren't available, the all-at-once strategy uses commentary or critical reviews as the reference standard, extracting key critiques and assessing whether the generated answers adequately address them.

Evaluation Workflow

1. Generate Answers

Before evaluation, generate answers using the run_generate_answers.py script. This processes a research paper PDF and produces a JSON file with generated responses:

uv run python -m evaluation.run_generate_answers \
  --input_pdf "evaluation/assets/test_cases/gagliano_2014/gagliano_2014.pdf" \
  --output_dir "evaluation/assets/generated_outputs"

2. One-by-One Evaluation

This approach requires a golden dataset with correct answers for each question. Run the evaluation with:

uv run python -m evaluation.main --strategy one_by_one \
  --input_file "evaluation/assets/generated_outputs/gagliano_2014_generated_20250922_193658.json" \
  --golden_dataset "evaluation/assets/golden_datasets/gagliano_2014_golden.json"

Golden Dataset Structure:

The golden dataset follows this JSON structure:

[
    {
        "input": "Identify the paper's primary experimental hypothesis. State the prediction in its operational terms, specifying the independent/predictor and dependent/outcome variables, and clarify whether it proposes a causal, descriptive-comparative, or associational relationship.",
        "expected_output": "The paper proposes a causal hypothesis: that Mimosa pudica plants are capable of a simple form of learning called habituation. Operationally, it predicts that repeated physical disturbances (the independent variable: a 15-cm drop) will cause a decrease in the plant's defensive leaf-folding response (the dependent variable: measured as the degree of leaf openness). It further predicts that this learned behavior will be more pronounced and persistent in low-light (energetically costly) environments compared to high-light environments.",
        "retrieval_context": [
            "In Mimosa pudica the sensitive plant-the defensive leaf-folding behaviour in response to repeated physical disturbance exhibits clear habituation, suggesting some elementary form of learning.",
            "Applying the theory and the analytical methods usually employed in animal learning research, we show that leaf-folding habituation is more pronounced and persistent for plants growing in energetically costly environments.",
            "Plants were trained using a custom-designed controlled drop system (Fig. 1c) for administering a standardised stimulus (i.e. a 15-cm fall or drop) that successfully elicits the leaf-folding reflex.",
            "its response was then quantified as the maximum leaf breadth (mm) measured immediately at the end of a train of drops (Fig. 1b) relative to the undisturbed pre-stimulus maximum breadth (Fig. la)."
        ]
    },
    {
        "input": "Describe the research method by identifying the study's design (e.g., true experimental, quasi-experimental). Detail the sampling procedure and key characteristics of the participants. Specify how the key variables were operationalized and measured, and list the primary statistical analyses used to test the hypothesis.",
        "expected_output": "...",
        "retrieval_context": ["..."]
    }
]

Important: The input field in the golden dataset must exactly match the prompts used during answer generation.

For a complete example, see evaluation/assets/golden_datasets/gagliano_2014_golden.json.

3. All-at-Once Evaluation

This strategy uses commentary files (critical reviews or analyses) instead of golden datasets. The system extracts key critiques from the commentary and evaluates whether the generated answers address these points:

uv run python -m evaluation.main --strategy all_at_once \
  --input_file "evaluation/assets/generated_outputs/gagliano_2014_generated_20250922_193658.json" \
  --commentary_file "evaluation/assets/test_cases/gagliano_2014/biegler_2018.pdf"

This approach is particularly useful when dealing with nuanced critiques or when golden answers are not available but expert commentary exists.

FAQ

1. How do I change the LLM model from GPT-OSS:20b to another model?

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests if applicable
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Submit a pull request

License

[Add your license information here]

Thesis Falsifier

Thesis Falsifier

Table of Contents

Features

Architecture

Prerequisites

Quick Start

Using Docker (Recommended)

Using Local Installation

Installation

Method 1: Docker (Recommended for New Users)

Method 2: Local Development Setup

Usage

Basic Usage

Command Line Interface

Package Management

Docker Setup

Prerequisites

Quick Start (Local Only)

Alternative: Port Mapping (if not using host network)

Complete Startup Sequence

Shutdown Instructions

Troubleshooting

Openrouter API

Persistent Deployment

Prerequisites

Quick Setup with Tmux

Managing Tmux Sessions

Deployment Checklist

Monitoring and Maintenance

Public Access with Ngrok (Optional)

Configuration

Model Settings

Changing the Model

Development

Project Structure

Running Tests

Code Quality

Evaluation Questions

Evaluation

Evaluation Workflow

1. Generate Answers

2. One-by-One Evaluation

3. All-at-Once Evaluation

FAQ

Contributing

License

Related Documents

📈 Trading RAG Mentor

AI Tester Interview Preparation Guide

AWS Certified Generative AI Developer – Professional (AIP-C01)

Understanding the Sources of Uncertainty - and Why Our Evals are Biased