Loading...
Loading...
**Project Name:** location-converter-xlsx-csv-pgtable
# Product Requirements Document (PRD)
## Location Converter: Excel to CSV to PostgreSQL Pipeline
### Project Overview
**Project Name:** location-converter-xlsx-csv-pgtable
**Version:** 1.0
**Date:** December 2024
**Status:** Development Phase
### 1. Executive Summary
This project aims to create a robust data pipeline that processes farmer location data from Excel files, performs data cleaning and analysis, converts the data to CSV format, and finally loads it into a PostgreSQL database table called `bellary_farmers`.
### 2. Project Objectives
#### Primary Goals
- Extract and analyze data from multi-sheet Excel files (Bellary.xlsx)
- Clean and standardize farmer location data
- Convert processed data to CSV format
- Load data into PostgreSQL database for further analysis
- Provide data insights through Jupyter notebook analysis
#### Success Metrics
- Successfully process all 5 sheets from the Excel file
- Achieve 100% data integrity during conversion
- Complete data loading into PostgreSQL without errors
- Generate comprehensive data analysis report
### 3. Data Schema Definition
Based on the provided parameters, the following data structure will be implemented:
#### Source Data Fields
| Field Name | Data Type | Description | Constraints |
|------------|-----------|-------------|-------------|
| S.no | Integer | Serial number/Record ID | Primary identifier |
| Status | String | Current status of farmer record | Enum values |
| Farmer Code | String | Unique farmer identifier | Unique, Not Null |
| Farmer Name | String | Full name of the farmer | Not Null |
| Govt id num | String | Government issued ID number | Unique |
| Primary Contact | String | Phone/contact information | Format validation |
| Farmer Created Date | DateTime | Date when farmer record was created | ISO format |
| Plot Created Date | DateTime | Date when plot was registered | ISO format |
| Sync Date | DateTime | Last synchronization date | ISO format |
| Updation Date | DateTime | Last update timestamp | ISO format |
| Area | Float | Plot area measurement | Positive value |
| GPS Area | Float | GPS-calculated area | Positive value |
| Village | String | Village name | Not Null |
#### Target PostgreSQL Schema
```sql
CREATE TABLE bellary_farmers (
id SERIAL PRIMARY KEY,
serial_number INTEGER,
status VARCHAR(50),
farmer_code VARCHAR(100) UNIQUE NOT NULL,
farmer_name VARCHAR(255) NOT NULL,
govt_id_number VARCHAR(50) UNIQUE,
primary_contact VARCHAR(20),
farmer_created_date TIMESTAMP,
plot_created_date TIMESTAMP,
sync_date TIMESTAMP,
updation_date TIMESTAMP,
area DECIMAL(10,4),
gps_area DECIMAL(10,4),
village VARCHAR(100) NOT NULL,
source_sheet VARCHAR(50),
imported_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
### 4. Technical Requirements
#### 4.1 Environment Setup
- **Python Version:** 3.11+
- **Package Manager:** uv
- **Virtual Environment:** .venv
#### 4.2 Core Dependencies
```
jupyterlab>=4.0.0
pandas>=2.0.0
numpy>=1.24.0
openpyxl>=3.1.0
psycopg2-binary>=2.9.0
sqlalchemy>=2.0.0
python-dotenv>=1.0.0
```
#### 4.3 Development Dependencies
```
pytest>=7.0.0
black>=23.0.0
flake8>=6.0.0
pre-commit>=3.0.0
```
### 5. Functional Requirements
#### 5.1 Data Extraction
- **FR-001:** System shall read Excel files with multiple sheets
- **FR-002:** System shall identify and process exactly 5 sheets from Bellary.xlsx
- **FR-003:** System shall handle different column naming conventions across sheets
- **FR-004:** System shall preserve data integrity during extraction
#### 5.2 Data Analysis & Cleaning
- **FR-005:** System shall provide comprehensive data profiling in Jupyter notebook
- **FR-006:** System shall identify and report data quality issues
- **FR-007:** System shall standardize column names across all sheets
- **FR-008:** System shall validate data types and formats
- **FR-009:** System shall handle missing values appropriately
- **FR-010:** System shall detect and report duplicate records
#### 5.3 Data Transformation
- **FR-011:** System shall merge data from all 5 sheets into a single dataset
- **FR-012:** System shall apply consistent data formatting rules
- **FR-013:** System shall add metadata columns (source_sheet, imported_at)
- **FR-014:** System shall export cleaned data to CSV format
#### 5.4 Database Integration
- **FR-015:** System shall connect to PostgreSQL database
- **FR-016:** System shall create bellary_farmers table if not exists
- **FR-017:** System shall load CSV data into PostgreSQL table
- **FR-018:** System shall handle database connection errors gracefully
- **FR-019:** System shall provide data loading progress feedback
### 6. Non-Functional Requirements
#### 6.1 Performance
- **NFR-001:** Data processing shall complete within 5 minutes for files up to 10MB
- **NFR-002:** Database loading shall handle up to 100,000 records efficiently
- **NFR-003:** Memory usage shall not exceed 2GB during processing
#### 6.2 Reliability
- **NFR-004:** System shall have 99% success rate for data conversion
- **NFR-005:** System shall provide detailed error logging
- **NFR-006:** System shall support transaction rollback on failures
#### 6.3 Usability
- **NFR-007:** System shall provide clear progress indicators
- **NFR-008:** System shall generate comprehensive data quality reports
- **NFR-009:** System shall include detailed documentation and examples
### 7. Project Structure
```
location-converter-xlsx-csv-pgtable/
├── .venv/ # Virtual environment
├── .cursor/ # Cursor IDE configuration
├── data/
│ ├── raw/ # Original Excel files
│ ├── processed/ # Cleaned CSV files
│ └── reports/ # Data quality reports
├── notebooks/
│ ├── 01_data_exploration.ipynb # Initial data analysis
│ ├── 02_data_cleaning.ipynb # Data cleaning process
│ └── 03_data_validation.ipynb # Final validation
├── src/
│ ├── __init__.py
│ ├── data_extractor.py # Excel reading functionality
│ ├── data_cleaner.py # Data cleaning utilities
│ ├── data_validator.py # Data validation rules
│ └── db_loader.py # Database operations
├── scripts/
│ ├── setup_environment.py # Environment setup
│ ├── run_pipeline.py # Main pipeline execution
│ └── create_database.py # Database schema creation
├── tests/
│ ├── test_data_extractor.py
│ ├── test_data_cleaner.py
│ └── test_db_loader.py
├── config/
│ ├── database.yaml # Database configuration
│ └── data_mapping.yaml # Column mapping rules
├── .env.example # Environment variables template
├── .gitignore
├── requirements.txt
├── README.md
├── PRD.md # This document
└── Makefile # Build automation
```
### 8. Implementation Phases
#### Phase 1: Project Setup (Week 1)
- Initialize uv environment
- Install required dependencies
- Set up project structure
- Create basic documentation
#### Phase 2: Data Exploration (Week 1-2)
- Analyze Excel file structure
- Identify data quality issues
- Create data profiling notebook
- Define cleaning strategies
#### Phase 3: Data Processing Pipeline (Week 2-3)
- Implement data extraction module
- Develop data cleaning utilities
- Create validation framework
- Build CSV export functionality
#### Phase 4: Database Integration (Week 3-4)
- Set up PostgreSQL connection
- Create database schema
- Implement data loading pipeline
- Add error handling and logging
#### Phase 5: Testing & Documentation (Week 4)
- Write comprehensive tests
- Create user documentation
- Perform end-to-end testing
- Optimize performance
### 9. Risk Assessment
#### High Risk
- **Data Quality Issues:** Inconsistent data formats across sheets
- **Database Connectivity:** Network or authentication failures
- **Memory Constraints:** Large file processing limitations
#### Medium Risk
- **Column Mapping:** Variations in column names across sheets
- **Data Type Conversion:** Invalid data format handling
- **Performance:** Slow processing for large datasets
#### Low Risk
- **Environment Setup:** Dependency installation issues
- **File Access:** Permission or path-related problems
### 10. Success Criteria
#### Technical Success
- [ ] All 5 Excel sheets successfully processed
- [ ] Data quality report generated with <5% error rate
- [ ] CSV file created with all required columns
- [ ] PostgreSQL table populated with complete data
- [ ] Pipeline execution time under 5 minutes
#### Business Success
- [ ] Farmer data accessible for analysis
- [ ] Data integrity maintained throughout pipeline
- [ ] Reproducible process for future data updates
- [ ] Comprehensive documentation for maintenance
### 11. Deliverables
1. **Working Data Pipeline**
- Complete Python application
- Jupyter notebooks for analysis
- Database schema and loading scripts
2. **Documentation**
- Technical documentation
- User guide
- API documentation
- Data dictionary
3. **Testing Suite**
- Unit tests
- Integration tests
- Data validation tests
4. **Configuration Files**
- Environment setup
- Database configuration
- Data mapping rules
### 12. Maintenance & Support
#### Ongoing Requirements
- Regular data quality monitoring
- Database performance optimization
- Documentation updates
- Bug fixes and enhancements
#### Support Model
- Self-service documentation
- Issue tracking system
- Regular maintenance schedule
---
**Document Version:** 1.0
**Last Updated:** December 2024
**Next Review:** January 2025 SkillSprout is an AI-powered microlearning platform designed to help users learn new skills through bite-sized lessons and adaptive quizzes. The platform leverages Azure OpenAI for content generation, Gradio for user interaction, and Model Context Protocol (MCP) for agent interoperability.
This dashboard is a web-based interface built using **Next.js (or Astro)** and hosted on **Vercel**. It acts as the control center for Joey’s stock intelligence, allowing you to:
Gemini Code Flow is an advanced AI-powered development orchestration platform that adapts RuV's Claude Code Flow for Google's Gemini CLI. It enables developers to leverage multiple AI agents working in parallel to write, test, and optimize code using the SPARC methodology.
**Version: 6.0 (FINAL)**