Loading...
Loading...
A tool to aid researchers in assessing whether research papers adhere to scientific best practices. This application uses AI to automatically generate falsification forms, helping researchers verify the scientific robustness of their work across disciplines including social sciences and natural sciences.
# Thesis Falsifier
A tool to aid researchers in assessing whether research papers adhere to scientific best practices. This application uses AI to automatically generate falsification forms, helping researchers verify the scientific robustness of their work across disciplines including social sciences and natural sciences.
The application emphasizes key scientific principles such as:
- The falsifiability of hypotheses
- The adherence to a clear and rigorous methodology
- Drawing conclusions strictly supported by empirical findings
## Table of Contents
- [Features](#features)
- [Architecture](#architecture)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Usage](#usage)
- [Openrouter API](#openrouter-api)
- [Docker Setup](#docker-setup)
- [Configuration](#configuration)
- [Development](#development)
- [Evaluation](#evaluation)
- [FAQ](#faq)
- [Contributing](#contributing)
## Features
- **PDF Processing**: Automatically extracts and analyzes text from research papers
- **AI-Powered Analysis**: Uses large language models to assess scientific rigor
- **Comprehensive Evaluation**: 19-point falsification form covering research questions, methods, results, conclusions, and ethics
- **Web Interface**: User-friendly Gradio interface for easy interaction
- **GPU Acceleration**: Optimized for NVIDIA GPUs with CUDA support
- **Docker Support**: Easy deployment and sharing across different environments
- **PDF Report Generation**: Downloadable assessment reports
## Architecture
```
PDF → Text → Chunks → Embeddings → FAISS Index
↓
Question → Embedding → FAISS Search → Relevant Chunks
```
The application follows a RAG (Retrieval-Augmented Generation) architecture:
1. **Document Processing**: PDFs are parsed and chunked into manageable pieces
2. **Vector Embedding**: Text chunks are converted to vector embeddings using Nomic embeddings
3. **Semantic Search**: FAISS index enables fast similarity search for relevant context
4. **AI Generation**: LLM generates falsification assessments based on relevant context
## Prerequisites
- **Python 3.10+**
- **NVIDIA GPU** with CUDA support (recommended 8GB+ VRAM)
- **Docker** (optional, for containerized deployment)
- **Ngrok account** (optional, for public sharing)
- **Openrouter API** (for LLM access, if no access to GPU)
## Quick Start
### Using Docker (Recommended)
```bash
# Build and run (using deployment script)
./start_app.sh
# Or manually:
docker build -t falsification-form -f deployment/Dockerfile .
docker run --gpus all -p 7860:7860 falsification-form
# Access at http://localhost:7860
```
### Using Local Installation
```bash
# Install dependencies
uv sync --all-extras
# Run the application
uv run frontend/gradio_ui.py
```
## Installation
### Method 1: Docker (Recommended for New Users)
1. **Clone the repository**:
```bash
git clone <repository-url>
cd falsification_form
```
2. **Build and run with Docker**:
```bash
docker build -t falsification-form -f deployment/Dockerfile .
docker run --gpus all -p 7860:7860 falsification-form
```
3. **Access the application** at `http://localhost:7860`
### Method 2: Local Development Setup
1. **Clone the repository**:
```bash
git clone <repository-url>
cd falsification_form
```
2. **Install dependencies**:
```bash
uv sync --all-extras
```
3. **Activate the virtual environment** (optional - if you prefer not using `uv run`):
```bash
source .venv/bin/activate
```
4. **Run the application**:
```bash
uv run frontend/gradio_ui.py
```
## Usage
### Basic Usage
1. **Upload a PDF research paper** through the web interface
2. **Click "Generate Falsification Report"** to start the analysis
3. **Review the generated assessment** - the system evaluates 19 different aspects of scientific rigor
4. **Download the PDF report** for offline review
### Command Line Interface
For batch processing or integration:
```bash
# Process a single PDF file
uv run app.py --pdf <path_to_pdf> --out <output_file.json>
```
### Package Management
To install all dependencies (including dev dependencies):
```bash
uv sync --all-extras
```
To run individual files:
```bash
uv run app.py
```
Or alternatively:
```bash
# Activate the virtual environment and run normally
uv shell
python app.py
```
## Docker Setup
### Prerequisites
Before running the Docker container, you need to set up Ollama on your host machine:
1. **Install Ollama**:
```bash
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
```
2. **Start Ollama server**:
```bash
ollama serve
```
3. **Pull required models** (in a new terminal):
```bash
# Pull the main language model
ollama pull gpt-oss:20b
# Pull the embedding model
ollama pull nomic-embed-text
```
### Quick Start (Local Only)
```bash
# Build the Docker image
docker build -t falsification-form -f deployment/Dockerfile .
# Run the application (make sure Ollama is running on host)
docker run --gpus all -p 7860:7860 --network host falsification-form
```
The application will be available at `http://localhost:7860`
**Note**: The `--network host` flag allows the Docker container to access Ollama running on the host machine at `localhost:11434`.
### Alternative: Port Mapping (if not using host network)
```bash
# If you prefer port mapping instead of host network
docker run --gpus all -p 7860:7860 -p 11434:11434 falsification-form
```
### Complete Startup Sequence
1. **Start Ollama server** (in terminal 1):
```bash
ollama serve
```
2. **Pull models** (in terminal 2):
```bash
ollama pull gpt-oss:20b
ollama pull nomic-embed-text
```
3. **Run Docker container** (in terminal 3):
```bash
docker run --gpus all -p 7860:7860 --network host falsification-form
```
4. **Access the application** at `http://localhost:7860`
### Shutdown Instructions
To properly shut down the entire system:
1. **Stop the Docker container**:
```bash
# If running in foreground: Ctrl+C
# If running in background:
docker stop <container_name_or_id>
```
2. **Stop Ollama server**:
```bash
# In the terminal where ollama serve is running: Ctrl+C
# Or kill the process:
pkill ollama
```
3. **Stop ngrok** (if running):
```bash
# In the terminal where ngrok is running: Ctrl+C
# Or kill the process:
pkill ngrok
```
### Troubleshooting
**Connection Error**: If you see "Failed to connect to Ollama", ensure:
- Ollama is installed and running (`ollama serve`)
- Required models are downloaded (`ollama list` to check)
- Docker container can reach host Ollama (use `--network host` or proper port mapping)
**Model Not Found**: If models aren't available:
```bash
# Check available models
ollama list
# Pull missing models
ollama pull gpt-oss:20b
ollama pull nomic-embed-text
```
## Openrouter API
1. **Setup a free API key at Openrouter**
2. **Create a .env file to store the key**
The line in the .env file should look like:
```
OPENROUTER_API_KEY=sk-or-<your_key_here>
```
3. **Make sure that the `dev_settings.yaml` file is set to "api"**
## Persistent Deployment
For long-running deployments that persist across SSH disconnections, you can use tmux sessions. This approach keeps your application running even when you close terminal windows or SSH connections.
### Prerequisites
- **Install tmux**: `sudo apt install tmux` (Ubuntu/Debian) or `brew install tmux` (macOS)
- **Ollama running**: Ensure Ollama is configured as a systemd service and running (`ollama serve`)
- **Models downloaded**: Verify required models are available (`ollama list`)
### Quick Setup with Tmux
**Option 1: Use the provided startup script**
```bash
# Make the script executable
chmod +x start_app.sh
# Start the application
./start_app.sh
```
**Option 2: Manual tmux setup**
```bash
# Start Docker container in tmux
tmux new-session -d -s falsification-app
tmux send-keys -t falsification-app "docker run --gpus all -p 7860:7860 --network host falsification-form" Enter
# Wait for app to start, then start ngrok
sleep 10
tmux new-session -d -s ngrok
tmux send-keys -t ngrok "ngrok start gradio" Enter
```
### Managing Tmux Sessions
**List and attach to sessions:**
```bash
# List all sessions
tmux list-sessions
# Attach to view the app session
tmux attach-session -t falsification-app
# Attach to view ngrok session
tmux attach-session -t ngrok
# Detach from a session: Ctrl+B, then D
```
**Check status without attaching:**
```bash
# Check if app is running
curl http://localhost:7860
# Check ngrok tunnels
curl http://localhost:4040/api/tunnels
# View recent output from sessions
tmux capture-pane -t falsification-app -p
tmux capture-pane -t ngrok -p
```
**Stop sessions:**
```bash
# Kill sessions when done
tmux kill-session -t falsification-app
tmux kill-session -t ngrok
```
### Deployment Checklist
Before starting persistent deployment:
- [ ] Ollama is installed and running as a systemd service
- [ ] Required models are downloaded (`ollama pull gpt-oss:20b` and `ollama pull nomic-embed-text`)
- [ ] Docker image is built (`docker build -t falsification-form -f deployment/Dockerfile .`)
- [ ] Ngrok is configured (if using public access)
- [ ] tmux is installed
### Monitoring and Maintenance
**Check application health:**
```bash
# Test application endpoint
curl -f http://localhost:7860 || echo "App is down"
# Check Docker container status
docker ps | grep falsification-form
# Check Ollama status
curl -f http://localhost:11434/api/tags || echo "Ollama is down"
```
**View logs:**
```bash
# Docker container logs
docker logs falsification-app
# View tmux session output
tmux capture-pane -t falsification-app -p
```
**Restart the application:**
```bash
# Kill and restart the app session
tmux kill-session -t falsification-app
tmux new-session -d -s falsification-app
tmux send-keys -t falsification-app "docker run --gpus all -p 7860:7860 --network host falsification-form" Enter
```
### Public Access with Ngrok (Optional)
If you want to share your application publicly:
1. **Create ngrok account**: Go to https://ngrok.com/ and sign up
2. **Get your auth token** from the ngrok dashboard
3. **Configure ngrok**:
**Option A - Run ngrok from inside container**:
Ngrok is already installed in the container, so no need to download.
Configure the ngrok file:
- **Linux**: `~/.config/ngrok/ngrok.yml`
- **MacOS**: `~/Library/Application Support/ngrok/ngrok.yml`
- **Windows**: `%HOMEPATH%\AppData\Local\ngrok\ngrok.yml`
You may copy-paste the `deployment/ngrok.yml.template` file and fill in the three variables: `NGROK_AUTH_TOKEN`, `USERNAME:PASSWORD`.
With that configured, you may now run:
```bash
# In a new terminal, access the running container
docker exec -it <container_name> bash
# Start tunnel
ngrok start gradio # or simpler (but without authentication): ngrok http 7860
```
**Option B - Run ngrok from host machine**:
First install ngrok on your host machine. Then configure the ngrok file:
- **Linux**: `~/.config/ngrok/ngrok.yml`
- **MacOS**: `~/Library/Application Support/ngrok/ngrok.yml`
- **Windows**: `%HOMEPATH%\AppData\Local\ngrok\ngrok.yml`
You may copy-paste the `deployment/ngrok.yml.template` file and fill in the three variables: `NGROK_AUTH_TOKEN`, `USERNAME:PASSWORD`.
With that configured, you may now run:
```bash
# Run (while container is running)
ngrok start gradio # or simpler (but without authentication): ngrok http 7860
```
The ngrok dashboard will show you the public URL you can share.
**Complete Shutdown with Ngrok**:
```bash
# 1. Stop ngrok (Ctrl+C in ngrok terminal)
# 2. Stop Docker container (Ctrl+C or docker stop)
# 3. Stop Ollama (Ctrl+C in ollama terminal)
# 4. Verify all processes are stopped:
ps aux | grep -E "(ollama|ngrok|docker)"
```
## Configuration
### Model Settings
The application uses configurable AI models. Edit `dev_settings.yaml`:
```yaml
model:
model: "gpt-oss:20b"
temperature: 0.7 # Controls randomness in generation
max_tokens: 1024
n_gpu_layers: -1 # Offload all possible layers to GPU
n_ctx: 8192 # Context window size
embedder:
model_name: "nomic-embed-text"
device: "gpu"
debug:
verbose: true
log_prompts: true
```
When doing this, make sure that the LLM-model you configure have been pulled with Ollama, `ollama pull <model_name>`. You can verify the models you have downloaded on the host machine with `ollama ls`.
### Changing the Model
To use a different model, modify the `model` field in `dev_settings.yaml`. The application supports various Ollama-compatible models.
## Development
### Project Structure
```
├── deployment/ # Deployment scripts and configuration
│ ├── Dockerfile # Container configuration
│ ├── docker-entrypoint.sh # Container startup script
│ ├── start_app.sh # Application startup script
│ ├── kill_app.sh # Application shutdown script
│ └── ngrok.yml.template # Ngrok configuration template
├── frontend/ # Gradio web interface
│ └── gradio_ui.py # Main web UI
├── models/ # AI model interfaces
│ └── interface.py # Model abstraction layer
├── rag/ # Retrieval-augmented generation
│ └── retriever.py # FAISS search and chunking
├── utils/ # Utility functions
│ ├── config.py # Configuration loading
│ ├── pdf_generator.py # PDF report generation
│ └── log_utils.py # Logging utilities
├── prompts/ # AI prompt templates
│ ├── form_questions.json # Evaluation questions
│ ├── system_prompt.txt # System instructions
│ └── templates/ # Prompt templates
├── assets/ # Static resources and assets
│ ├── fonts/ # Font files for PDF generation
│ └── research_articles/ # Research papers and related documents
├── notebooks/ # Development notebooks
├── cheatsheets/ # Reference materials
├── logs/ # Application logs
├── start_app.sh # Wrapper script for deployment/start_app.sh
└── kill_app.sh # Wrapper script for deployment/kill_app.sh
```
### Running Tests
```bash
uv run pytest
```
### Code Quality
```bash
uv run ruff check .
uv run mypy .
```
### Evaluation Questions
The application evaluates research papers across 11 key questions organized into 3 sections:
**Section 1: Research Summary (4 questions)**
- 1.1: Primary experimental hypothesis identification
- 1.2: Research method and design description
- 1.3: Key findings summary with statistical results
- 1.4: Authors' primary conclusion and its relationship to hypothesis
**Section 2: Critical Analysis (4 questions)**
- 2.1: Falsifiability assessment of the hypothesis
- 2.2: Consistency evaluation between hypothesis and research design
- 2.3: Internal and construct validity critique
- 2.4: Logical connection evaluation from findings to conclusions
**Section 3: Overall Assessment (3 questions)**
- 3.1: Concise holistic summary of the research paper
- 3.2: Identification of the most significant strength and weakness
- 3.3: Concrete methodology improvement proposal
## Evaluation
The repository includes a comprehensive evaluation framework that leverages DeepEval's LLM-as-a-judge paradigm to assess the quality of generated falsification form answers. The framework supports two distinct evaluation strategies:
- **One-by-one evaluation**: Compares generated answers against golden reference answers for each question individually
- **All-at-once evaluation**: Evaluates answers against broader commentary or critical reviews of research papers
The falsification form consists of 11 questions, and effective evaluation requires ground truth data for comparison. When specific golden answers aren't available, the all-at-once strategy uses commentary or critical reviews as the reference standard, extracting key critiques and assessing whether the generated answers adequately address them.
### Evaluation Workflow
#### 1. Generate Answers
Before evaluation, generate answers using the `run_generate_answers.py` script. This processes a research paper PDF and produces a JSON file with generated responses:
```bash
uv run python -m evaluation.run_generate_answers \
--input_pdf "evaluation/assets/test_cases/gagliano_2014/gagliano_2014.pdf" \
--output_dir "evaluation/assets/generated_outputs"
```
#### 2. One-by-One Evaluation
This approach requires a golden dataset with correct answers for each question. Run the evaluation with:
```bash
uv run python -m evaluation.main --strategy one_by_one \
--input_file "evaluation/assets/generated_outputs/gagliano_2014_generated_20250922_193658.json" \
--golden_dataset "evaluation/assets/golden_datasets/gagliano_2014_golden.json"
```
**Golden Dataset Structure:**
The golden dataset follows this JSON structure:
```json
[
{
"input": "Identify the paper's primary experimental hypothesis. State the prediction in its operational terms, specifying the independent/predictor and dependent/outcome variables, and clarify whether it proposes a causal, descriptive-comparative, or associational relationship.",
"expected_output": "The paper proposes a causal hypothesis: that Mimosa pudica plants are capable of a simple form of learning called habituation. Operationally, it predicts that repeated physical disturbances (the independent variable: a 15-cm drop) will cause a decrease in the plant's defensive leaf-folding response (the dependent variable: measured as the degree of leaf openness). It further predicts that this learned behavior will be more pronounced and persistent in low-light (energetically costly) environments compared to high-light environments.",
"retrieval_context": [
"In Mimosa pudica the sensitive plant-the defensive leaf-folding behaviour in response to repeated physical disturbance exhibits clear habituation, suggesting some elementary form of learning.",
"Applying the theory and the analytical methods usually employed in animal learning research, we show that leaf-folding habituation is more pronounced and persistent for plants growing in energetically costly environments.",
"Plants were trained using a custom-designed controlled drop system (Fig. 1c) for administering a standardised stimulus (i.e. a 15-cm fall or drop) that successfully elicits the leaf-folding reflex.",
"its response was then quantified as the maximum leaf breadth (mm) measured immediately at the end of a train of drops (Fig. 1b) relative to the undisturbed pre-stimulus maximum breadth (Fig. la)."
]
},
{
"input": "Describe the research method by identifying the study's design (e.g., true experimental, quasi-experimental). Detail the sampling procedure and key characteristics of the participants. Specify how the key variables were operationalized and measured, and list the primary statistical analyses used to test the hypothesis.",
"expected_output": "...",
"retrieval_context": ["..."]
}
]
```
> **Important**: The `input` field in the golden dataset must exactly match the prompts used during answer generation.
For a complete example, see `evaluation/assets/golden_datasets/gagliano_2014_golden.json`.
#### 3. All-at-Once Evaluation
This strategy uses commentary files (critical reviews or analyses) instead of golden datasets. The system extracts key critiques from the commentary and evaluates whether the generated answers address these points:
```bash
uv run python -m evaluation.main --strategy all_at_once \
--input_file "evaluation/assets/generated_outputs/gagliano_2014_generated_20250922_193658.json" \
--commentary_file "evaluation/assets/test_cases/gagliano_2014/biegler_2018.pdf"
```
This approach is particularly useful when dealing with nuanced critiques or when golden answers are not available but expert commentary exists.
## FAQ
**1. How do I change the LLM model from GPT-OSS:20b to another model?**
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests if applicable
5. Commit your changes (`git commit -m 'Add some amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Submit a pull request
## License
[Add your license information here]
FHD uses keywords to create unique run-specific settings. This dictionary describes the purpose of each keyword, as well as their logic or applicable ranges. Some keywords can override others, which is also documentated. The FHD default is listed when applicable, which can be overriden by a top-level script.
[← Back: Cost Model](05_cost_model.md) | [Back to Project →](README.md)
This is the source code of the EMNLP 2019 paper [**Event Detection with Trigger-Aware Lattice Neural Network**](https://www.aclweb.org/anthology/D19-1033.pdf) . TLNN model aims to address the issues of trigger-word mismatch and trigger polysemy. In this project, the event detection is a sequence labeling task. For more information, please read the paper.