Loading...
Loading...
Loading...
This guide explains each command in the Product Data Pipeline, what it does, and how to verify it's working correctly. It also outlines the complete workflow from data harvesting to product filtering and image processing.
# Product Data Pipeline Quality Assurance Guide
This guide explains each command in the Product Data Pipeline, what it does, and how to verify it's working correctly. It also outlines the complete workflow from data harvesting to product filtering and image processing.
## Table of Contents
1. [Command Overview](#1-command-overview)
2. [Complete Workflow](#2-complete-workflow)
3. [Step-by-Step Testing](#3-step-by-step-testing)
4. [Troubleshooting](#4-troubleshooting)
## 1. Command Overview
The Product Data Pipeline offers several command-line operations to manage the entire data pipeline. Here's an overview of each command:
### Data Harvesting Commands
| Command | Description | What It Does |
| ---------------- | ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `harvest:init` | Initial data harvest | Performs the first data collection from AliExpress API, creates new seller and product records. Use this when starting fresh or need complete data refresh. |
| `harvest:delta` | Incremental data harvest | Updates existing data and adds new sellers/products that weren't found in previous runs. Use this for regular updates without duplicating effort. |
| `harvest:status` | Show harvesting status | Displays statistics about harvesting jobs, seller approval counts, and category distributions. Use this to monitor progress and verify results. |
### Session Management Commands
| Command | Description | What It Does |
| ----------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------- |
| `create_session` | Create new API session | Creates a new AliExpress API session using authorization code. Required for accessing detailed product and shipping data. |
| `refresh_session` | Refresh existing session token | Refreshes API session tokens to maintain access. Can use database session or manual tokens. |
| `list_sessions` | List all stored sessions | Displays all API sessions with their status, metadata, and activity indicators. Use to monitor session health. |
### Product Processing Commands
| Command | Description | What It Does |
| ----------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `filter:products` | Filter and process products | Processes products from whitelisted sellers, applies business rules, enriches with shipping data, and triggers automatic image ingestion. |
### Review Process Commands
| Command | Description | What It Does |
| ----------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------- |
| `review:export-pending` | Export merchants for review | Creates a CSV file with all pending merchants that need review. This file will be shared with expert reviewers. |
| `review:import-results` | Import review results | Reads the reviewed CSV file where experts have updated approval statuses and updates the database accordingly. |
### Module C: Duplicate Detection Commands
| Command | Description | What It Does |
| ------------------------ | -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `detect:duplicates` | Run duplicate detection analysis | Analyzes all filtered products for duplicates using pHash and CLIP image analysis. Uses intelligent cascade: pHash for fast screening, CLIP for ambiguous cases. Groups duplicates and selects master based on lowest cost. |
| `detect:status` | Show duplicate detection status | Displays statistics about duplicate detection results, including counts by status (UNIQUE/DUPLICATE/MASTER/REVIEW_SUSPECT), detection methods used, average similarity scores, and confidence levels. |
| `detect:export-suspects` | Export suspect cases for review | Creates a CSV file with all REVIEW_SUSPECT cases that need manual verification. Includes complete product details, similarity metrics, and URLs for comparison. Creates review_decision and notes columns for manual input. |
| `detect:import-reviewed` | Import manual review results | Imports reviewed suspect duplicates from CSV and updates database. Handles DUPLICATE/UNIQUE/UNCERTAIN decisions. Automatically performs master reassignment when suspects should become new masters based on lower cost. |
## 2. Complete Workflow
The complete product data pipeline consists of the following enhanced workflow:
### Phase 1: Setup and Data Collection
1. **API Session Setup**
- Run `create_session --code YOUR_CODE` to establish API credentials
- This enables access to detailed product data and shipping information
- Session credentials are stored in the database for reuse
2. **Initial Data Collection**
- Run `harvest:init` to collect the initial set of merchants and products
- This creates records in the database with all merchants marked as "PENDING"
- Products include basic information from search results
3. **Regular Data Updates**
- Run `harvest:delta` on a scheduled basis (e.g., daily) to:
- Update information for existing merchants
- Add new merchants that weren't found before
- All new merchants are marked as "PENDING"
### Phase 2: Merchant Review Process
4. **Export for Expert Review**
- Run `review:export-pending` to generate a CSV file with all pending merchants
- This creates a file (default: `data/pending_merchants.csv`) containing merchant details
5. **Expert Review Process (Manual)**
- Experts open the exported CSV file
- For each merchant, they update the `approval_status` column:
- `PENDING` → Keep as is if still under review
- `WHITELIST` or `WHITE_LIST` → Approved merchants
- `BLACKLIST` or `BLACK_LIST` → Rejected merchants
- They can add notes in the `note` column explaining their decision
- Save the updated file (e.g., as `data/reviewed_merchants.csv`)
6. **Import Review Results**
- Run `review:import-results` to update the database with expert decisions
- The system applies all status changes and notes from the CSV file
- Only merchants with changed status (not PENDING) are updated
### Phase 3: Product Processing and Enrichment
7. **Product Filtering and Enrichment**
- Run `filter:products` to process products from WHITELIST approved sellers
- This command performs multiple operations:
- Fetches detailed product information using API sessions
- Retrieves shipping costs and delivery estimates for all product variants
- Applies business rules (max price, max delivery time)
- Stores qualifying products in `filtered_products` table
- Automatically triggers image ingestion for each processed product
8. **Image Processing (Automatic)**
- Triggered automatically during product filtering
- Categorizes product images into three types:
- **Hero Images**: First image from the product's image gallery
- **Variant Images**: Images associated with specific product variants (color, size, etc.)
- **Gallery Images**: Additional product photos (excluding hero and variant images)
- Stores image metadata including property associations and sort order
9. **Session Maintenance**
- Run `refresh_session` periodically to maintain API access
- The system can automatically refresh using stored credentials
- Manual token refresh is available if needed
### Phase 4: Duplicate Detection (Module C)
10. **Run Duplicate Detection**
- Run `detect:duplicates` to analyze filtered products for duplicates
- Uses intelligent cascade analysis (pHash → CLIP for ambiguous cases)
- Automatically groups duplicates and selects the best master product
- Updates product_status table with detection results
11. **Review Detection Results**
- Run `detect:status` to view duplicate detection statistics
- Check counts of UNIQUE, DUPLICATE, MASTER, and REVIEW_SUSPECT products
- Monitor detection method usage and confidence scores
12. **Manual Review Workflow**
- Run `detect:export-suspects --output data/review_cases.csv` to export REVIEW_SUSPECT cases
- Review the CSV file containing:
- Complete product details (titles, prices, costs, images, URLs)
- Similarity metrics (pHash difference, CLIP similarity)
- Master product information for comparison
- Make decisions by filling the review_decision column:
- `DUPLICATE`: Products are the same (will be merged)
- `UNIQUE`: Products are different (will remain separate)
- `UNCERTAIN`: Need further review (will stay as REVIEW_SUSPECT)
- Add explanatory notes in the notes column
- Test first with `detect:import-reviewed --input data/review_cases.csv --dry-run`
- Import decisions with `detect:import-reviewed --input data/review_cases.csv`
- System automatically handles master reassignment if cheaper products should become masters
### Phase 5: Monitoring and Analysis
13. **Monitor and Analyze**
- Run `harvest:status` regularly to check:
- Job history and performance
- Current approval status counts
- Product and category statistics
- Run `list_sessions` to monitor API session health
- Run `detect:status` to monitor duplicate detection performance
- Use database queries to analyze filtered products, images, and duplicates
## 3. Step-by-Step Testing
This section walks through testing each command and verifying its results.
### 3.1 Session Management Testing
#### 3.1.1 Create Session Test
```bash
# Create a new session (you'll need an authorization code from AliExpress)
python main.py create_session --code YOUR_AUTHORIZATION_CODE
```
**What Happens:**
- Sends authorization code to AliExpress API
- Receives access token, refresh token, and session metadata
- Stores session credentials in the database
- Returns session details
**How to Verify Success:**
```bash
# List sessions to verify creation
python main.py list_sessions
# Expected output should show:
# - A new session with "Active" status
# - Session details including user information
```
#### 3.1.2 Session Refresh Test
```bash
# Refresh using stored session (automatic)
python main.py refresh_session
# Or refresh using specific tokens (manual)
python main.py refresh_session --token ACCESS_TOKEN --refresh-token REFRESH_TOKEN
```
**What Happens:**
- Uses stored session credentials or provided tokens
- Calls AliExpress refresh endpoint
- Updates database with new tokens and expiration
**How to Verify Success:**
```bash
# List sessions to verify refresh
python main.py list_sessions
# Expected output should show:
# - Updated session with new timestamps
# - Session remains "Active"
```
#### 3.1.3 List Sessions Test
```bash
# Display all stored sessions
python main.py list_sessions
```
**What Happens:**
- Queries database for all session records
- Displays session metadata and status
**Expected Output:**
```
📋 Found 1 session(s):
- Code: your_code_here
Status: 🟢 Active
Type: Bearer
User: username
Account: account_info
Created: 2025-01-15 10:30:00
Updated: 2025-01-15 10:30:00
```
### 3.2 Data Harvesting Testing
#### 3.2.1 Initial Harvest Test
```bash
# Run a small initial harvest
python main.py harvest:init --limit 10
```
**What Happens:**
- Connects to AliExpress API
- Searches for products using configured categories or keywords
- For each product found:
- Creates a new seller record if not already in database
- Creates a new product record
- Tracks job progress and statistics
**How to Verify Success:**
```bash
# Check harvest status to verify job ran successfully
python main.py harvest:status
# Expected output should show:
# - A HARVEST_INIT job with found_count and new_count values
# - Seller counts showing all or mostly PENDING status
```
#### 3.2.2 Delta Harvest Test
```bash
# Run a delta harvest with a larger limit
python main.py harvest:delta --limit 20
```
**What Happens:**
- Connects to AliExpress API like initial harvest
- For products/sellers already in database:
- Updates their information (last_seen_at timestamp, etc.)
- For new products/sellers:
- Creates new records like the initial harvest
- Records are marked as "found" vs "new" in job statistics
**How to Verify Success:**
```bash
# Check harvest status to see both jobs
python main.py harvest:status
# Expected output should show:
# - A HARVEST_DELTA job with found_count ≥ new_count
# - More total sellers than after initial harvest
```
### 3.3 Merchant Review Testing
#### 3.3.1 Export Pending Merchants Test
```bash
# Export pending merchants
python main.py review:export-pending
```
**What Happens:**
- Queries database for all sellers with PENDING approval status
- Creates CSV file at `data/pending_merchants.csv` with:
- shop_id
- shop_url
- shop_name
- approval_status (all set to "PENDING")
- note
**How to Verify Success:**
```bash
# Check that file was created
ls -la data/pending_merchants.csv
# Preview the file contents
head data/pending_merchants.csv
# Count lines to verify number of records
wc -l data/pending_merchants.csv
```
#### 3.3.2 Review and Update CSV File (Manual Step)
This step would be performed by expert reviewers. For testing purposes:
```bash
# Create a test file with some modified statuses
head -n 5 data/pending_merchants.csv > data/reviewed_merchants.csv
sed -i '' '2s/PENDING/WHITELIST/' data/reviewed_merchants.csv
sed -i '' '3s/PENDING/BLACKLIST/' data/reviewed_merchants.csv
# View the modified file
cat data/reviewed_merchants.csv
```
**What an Expert Would Do:**
1. Open the CSV file in Excel, Google Sheets, or similar
2. Review each seller (checking their shop URL if needed)
3. Update approval_status to "WHITELIST" for approved sellers
4. Update approval_status to "BLACKLIST" for rejected sellers
5. Add notes explaining decisions in the note column
6. Save the file as `data/reviewed_merchants.csv`
#### 3.3.3 Import Review Results Test
```bash
# First test with dry run
python main.py review:import-results --dry-run
# Then perform actual import
python main.py review:import-results
```
**What Happens:**
- Reads the reviewed CSV file
- Normalizes status values (handling "WHITE_LIST" and "BLACK_LIST" formats)
- For each row:
- Skips sellers still marked as "PENDING"
- Updates database with new status and notes for WHITELIST/BLACKLIST entries
- Counts various outcomes (updated, skipped, errors, etc.)
- Displays summary statistics
**How to Verify Success:**
```bash
# Check seller approval counts
python main.py harvest:status
# Expected output should show:
# - Fewer PENDING sellers
# - Some WHITELIST and BLACKLIST sellers
```
### 3.4 Product Filtering Testing
#### 3.4.1 Product Filtering Test
```bash
# Filter products with business rules
python main.py filter:products --max-price 50.00 --max-delivery 14 --limit 5
```
**What Happens:**
- Queries database for products from WHITELIST approved sellers
- For each product:
- Fetches detailed product information using API session
- Retrieves shipping information for all product variants
- Calculates total cost (product price + shipping)
- Checks delivery time estimates
- Applies business rule filters (max price, max delivery)
- Stores qualifying products in `filtered_products` table
- Automatically triggers image ingestion for the product
**How to Verify Success:**
```bash
# Check if filtered products were created
python -c "
from src.common.database import get_db_session, FilteredProduct
with get_db_session() as db:
count = db.query(FilteredProduct).count()
print(f'Filtered products: {count}')
# Show sample records
products = db.query(FilteredProduct).limit(3).all()
for p in products:
print(f'Product {p.product_id}: €{p.target_sale_price} - {p.min_delivery_days}-{p.max_delivery_days} days')
"
```
#### 3.4.2 Image Ingestion Verification
```bash
# Check if images were automatically ingested
python -c "
from src.common.database import get_db_session, ProductImage
with get_db_session() as db:
count = db.query(ProductImage).count()
print(f'Total images: {count}')
# Show image breakdown by role
from sqlalchemy import func
breakdown = db.query(ProductImage.image_role, func.count(ProductImage.id)).group_by(ProductImage.image_role).all()
for role, count in breakdown:
print(f'{role} images: {count}')
"
```
#### 3.4.3 Shipping Information Verification
```bash
# Check shipping information was stored
python -c "
from src.common.database import get_db_session, ShippingInfo
with get_db_session() as db:
count = db.query(ShippingInfo).count()
print(f'Shipping records: {count}')
# Show sample shipping info
shipping = db.query(ShippingInfo).limit(3).all()
for s in shipping:
print(f'Product {s.product_id}, SKU {s.sku_id}: €{s.shipping_fee} - {s.max_delivery_days} days ({s.company})')
"
```
### 3.5 Module C: Duplicate Detection Testing
#### 3.5.1 Duplicate Detection Test
```bash
# Run duplicate detection on filtered products
python main.py detect:duplicates --limit 10
```
**What Happens:**
- Analyzes filtered products for duplicates using pHash and CLIP
- Uses intelligent cascade: pHash first, then CLIP for ambiguous cases
- Groups similar products and selects the master (lowest total cost)
- Updates product_status table with results
**Expected Results:**
- Console output shows progress of pHash and CLIP analysis
- Products are classified as UNIQUE, DUPLICATE, MASTER, or REVIEW_SUSPECT
- Database contains new records in product_status table
**Verification Queries:**
```sql
-- Check duplicate detection results
SELECT status, COUNT(*) as count FROM product_status GROUP BY status;
-- View detection details
SELECT product_id, status, duplicate_master_id, detection_method,
phash_difference, clip_similarity, total_landed_cost
FROM product_status LIMIT 10;
-- Check duplicate groups
SELECT duplicate_master_id, COUNT(*) as group_size
FROM product_status
WHERE duplicate_master_id IS NOT NULL
GROUP BY duplicate_master_id
ORDER BY group_size DESC;
```
#### 3.5.2 Detection Status Test
```bash
# Show duplicate detection statistics
python main.py detect:status
```
**What Happens:**
- Displays counts by status (UNIQUE, DUPLICATE, MASTER, REVIEW_SUSPECT)
- Shows detection method statistics (pHash vs CLIP usage)
- Reports average similarity scores and confidence levels
- Lists recent detection job performance
**Expected Output:**
```
Duplicate Detection Status:
=========================
Product Status Counts:
- UNIQUE: 45 products
- DUPLICATE: 12 products
- MASTER: 8 products
- REVIEW_SUSPECT: 3 products
Detection Method Usage:
- pHash Only: 53 products (77.9%)
- pHash + CLIP: 15 products (22.1%)
Average Scores:
- pHash Difference: 12.4 (out of 64)
- CLIP Similarity: 0.82 (out of 1.0)
- Confidence Score: 0.91 (out of 1.0)
```
#### 3.5.3 Export Suspects Test
```bash
# Export suspicious duplicate cases for review
python main.py detect:export-suspects --output data/review_suspects.csv
```
**What Happens:**
- Exports all REVIEW_SUSPECT products to CSV
- Includes product details, similarity scores, and image paths
- Creates file for manual expert review
**Expected Results:**
- CSV file created at specified location
- File contains products with ambiguous duplicate classification
- Columns include product_id, similarity scores, image information
**Sample CSV Content:**
```csv
product_id,status,duplicate_master_id,detection_method,phash_difference,clip_similarity,confidence_score,product_title,target_sale_price
1005009123456,REVIEW_SUSPECT,,CLIP,15,0.84,0.75,"Silver Ring Set",89.99
1005009789012,REVIEW_SUSPECT,,CLIP,12,0.87,0.78,"Sterling Silver Band",92.50
```
#### 3.5.4 Configuration Testing
```bash
# Test with different thresholds (modify .env first)
python main.py detect:duplicates --force --limit 5
```
**Configuration Variables to Test:**
```bash
# Stricter pHash detection (lower threshold)
PHASH_DUPLICATE_THRESHOLD=1
PHASH_AMBIGUOUS_THRESHOLD=10
# More lenient CLIP detection
CLIP_DUPLICATE_THRESHOLD=0.85
# Different CLIP model
CLIP_MODEL=ViT-B/16 # More accurate but slower
CLIP_DEVICE=cpu # Force CPU usage
```
**What to Verify:**
- Different thresholds produce different classification results
- Stricter thresholds create fewer duplicates (higher precision)
- Lenient thresholds create more duplicates (higher recall)
- CPU vs GPU performance differences
#### 3.5.5 Image Comparison Script Test
```bash
# Test the standalone image comparison script
python compare_product_images.py 1005009917334390 1005009919988717 --show-all 5 --verbose
```
**What Happens:**
- Compares pHash values between all images of two products
- Finds the SKU ID pair with smallest Hamming distance
- Shows detailed similarity statistics and image metadata
**Expected Output:**
- Best match with SKU IDs and similarity percentage
- Statistics breakdown (exact matches, similar, different)
- Top N comparisons with detailed image information
#### 3.5.6 Export/Import Review Workflow Test
**Step 1: Export Review Suspects**
```bash
# Export REVIEW_SUSPECT cases for manual review
python main.py detect:export-suspects --output data/test_review.csv
```
**What Happens:**
- Creates CSV file with all REVIEW_SUSPECT products
- Includes complete product details, similarity metrics, and URLs
- Creates empty review_decision and notes columns
**Expected CSV Columns:**
```
suspect_product_id,suspect_title,suspect_price,suspect_cost,suspect_image,suspect_product_url,
master_product_id,master_title,master_price,master_image,master_product_url,
phash_difference,clip_similarity,review_decision,notes
```
**Step 2: Manual Review Simulation**
```bash
# Create a test reviewed file
cp data/test_review.csv data/test_reviewed.csv
# Edit the CSV to add review decisions:
# - Set review_decision to "DUPLICATE", "UNIQUE", or "UNCERTAIN"
# - Add explanatory notes
```
**Step 3: Dry Run Import Test**
```bash
# Test the import without making changes
python main.py detect:import-reviewed --input data/test_reviewed.csv --dry-run
```
**Expected Dry Run Output:**
```
🔄 Would update 1005009123456: REVIEW_SUSPECT -> DUPLICATE
Master ID: 1005009789012 -> 1005009789012
🔄 Would also reassign master: 1005009789012 -> 1005009123456
📊 Would affect 3 products
📊 Summary:
Updated: 0
Skipped: 0
Total processed: 2
```
**Step 4: Actual Import Test**
```bash
# Apply the review decisions
python main.py detect:import-reviewed --input data/test_reviewed.csv
```
**Expected Results:**
- Products marked as DUPLICATE are updated in database
- Products marked as UNIQUE are updated in database
- Products marked as UNCERTAIN remain as REVIEW_SUSPECT
- Master reassignment occurs automatically when cheaper products become masters
**Verification Queries:**
```sql
-- Check updated statuses
SELECT product_id, status, duplicate_master_id
FROM product_status
WHERE product_id IN ('1005009123456', '1005009789012');
-- Verify master reassignment worked
SELECT ps.product_id, ps.status, ps.duplicate_master_id, fp.target_sale_price
FROM product_status ps
LEFT JOIN filtered_products fp ON ps.product_id = fp.product_id
WHERE ps.duplicate_master_id IN (
SELECT product_id FROM product_status WHERE status = 'MASTER'
)
ORDER BY ps.duplicate_master_id, fp.target_sale_price;
```
**Step 5: Master Reassignment Test**
To specifically test master reassignment:
1. Create a test scenario where a cheaper REVIEW_SUSPECT should become master:
```sql
-- Temporarily lower the price of a review suspect
UPDATE filtered_products
SET target_sale_price = 50.00
WHERE product_id = '1005009123456';
```
2. Mark it as DUPLICATE in the CSV and import
3. Verify that it becomes the new master and all other duplicates point to it
### 3.6 Advanced Testing
#### 3.6.1 End-to-End Workflow Test
```bash
# Complete workflow test
echo "=== Phase 1: Session Setup ==="
python main.py list_sessions
echo "=== Phase 2: Data Collection ==="
python main.py harvest:init --limit 5 --dry-run
echo "=== Phase 3: Status Check ==="
python main.py harvest:status
echo "=== Phase 4: Product Processing ==="
python main.py filter:products --limit 2 --dry-run
echo "=== Workflow test completed ==="
```
#### 3.5.2 Database Integrity Test
```bash
# Verify database relationships and data integrity
python -c "
from src.common.database import get_db_session, Product, FilteredProduct, ProductImage, ShippingInfo
with get_db_session() as db:
print('=== Database Integrity Check ===')
# Check products have sellers
products_without_sellers = db.query(Product).filter(Product.shop_id.is_(None)).count()
print(f'Products without sellers: {products_without_sellers}')
# Check filtered products exist in products table
orphaned_filtered = db.query(FilteredProduct).filter(
~FilteredProduct.product_id.in_(db.query(Product.product_id))
).count()
print(f'Orphaned filtered products: {orphaned_filtered}')
# Check images have valid products
orphaned_images = db.query(ProductImage).filter(
~ProductImage.product_id.in_(db.query(Product.product_id))
).count()
print(f'Orphaned images: {orphaned_images}')
print('=== Integrity check completed ===')
"
```
python main.py harvest:status
# Expected output should show:
# - A HARVEST_INIT job with found_count and new_count values
# - Seller counts showing all or mostly PENDING status
````
### 3.2 Delta Harvest Test
```bash
# Run a delta harvest with a larger limit
python main.py harvest:delta --limit 20
````
**What Happens:**
- Connects to AliExpress API like initial harvest
- For products/sellers already in database:
- Updates their information (last_seen_at timestamp, etc.)
- For new products/sellers:
- Creates new records like the initial harvest
- Records are marked as "found" vs "new" in job statistics
**How to Verify Success:**
```bash
# Check harvest status to see both jobs
python main.py harvest:status
# Expected output should show:
# - A HARVEST_DELTA job with found_count ≥ new_count
# - More total sellers than after initial harvest
```
### 3.3 Export Pending Merchants Test
```bash
# Export pending merchants
python main.py review:export-pending
```
**What Happens:**
- Queries database for all sellers with PENDING approval status
- Creates CSV file at `data/pending_merchants.csv` with:
- shop_id
- shop_url
- shop_name
- approval_status (all set to "PENDING")
- note
**How to Verify Success:**
```bash
# Check that file was created
ls -la data/pending_merchants.csv
# Preview the file contents
head data/pending_merchants.csv
# Count lines to verify number of records
wc -l data/pending_merchants.csv
```
### 3.4 Review and Update CSV File (Manual Step)
This step would be performed by expert reviewers. For testing purposes:
```bash
# Create a test file with some modified statuses
head -n 5 data/pending_merchants.csv > data/reviewed_merchants.csv
sed -i '' '2s/PENDING/WHITELIST/' data/reviewed_merchants.csv
sed -i '' '3s/PENDING/BLACKLIST/' data/reviewed_merchants.csv
# View the modified file
cat data/reviewed_merchants.csv
```
**What an Expert Would Do:**
1. Open the CSV file in Excel, Google Sheets, or similar
2. Review each seller (checking their shop URL if needed)
3. Update approval_status to "WHITELIST" for approved sellers
4. Update approval_status to "BLACKLIST" for rejected sellers
5. Add notes explaining decisions in the note column
6. Save the file as `data/reviewed_merchants.csv`
### 3.5 Import Review Results Test
```bash
# First test with dry run
python main.py review:import-results --dry-run
# Then perform actual import
python main.py review:import-results
```
**What Happens:**
- Reads the reviewed CSV file
- Normalizes status values (handling "WHITE_LIST" and "BLACK_LIST" formats)
- For each row:
- Skips sellers still marked as "PENDING"
- Updates database with new status and notes for WHITELIST/BLACKLIST entries
- Counts various outcomes (updated, skipped, errors, etc.)
- Displays summary statistics
**How to Verify Success:**
```bash
# Check seller approval counts
python main.py harvest:status
# Expected output should show:
# - Fewer PENDING sellers
# - Some WHITELIST and BLACKLIST sellers
```
## 4. Troubleshooting Common Issues
### 4.1 Session Management Issues
#### 4.1.1 Authorization Code Invalid
**Problem:** `create_session` fails with authorization error
**Solution:**
1. Verify the authorization code was copied correctly from AliExpress
2. Check if the code has expired (codes typically have short lifespans)
3. Ensure your AliExpress app credentials are configured correctly in `.env`
#### 4.1.2 Token Refresh Fails
**Problem:** `refresh_session` returns authentication error
**Solution:**
1. Check if tokens in database are corrupted:
```bash
python main.py list_sessions
```
2. If all sessions show as expired, create a new session:
```bash
python main.py create_session --code NEW_CODE
```
### 4.2 Harvest Issues
#### 4.2.1 No Products Found
**Problem:** Harvest completes but `found_count` is 0
**Possible Causes & Solutions:**
1. **Network Issues:** Check internet connection and AliExpress API status
2. **API Rate Limits:** Wait and retry, or reduce the `--limit` parameter
3. **Search Configuration:** Verify `CATEGORY` and search terms in `.env` file
4. **Session Expired:** Run `python main.py refresh_session` first
#### 4.2.2 Database Connection Errors
**Problem:** "Cannot connect to database" errors during harvest
**Solution:**
1. Check database file permissions (for SQLite)
2. Verify `DATABASE_URL` in `.env` file
3. For PostgreSQL, ensure server is running and credentials are correct
### 4.3 Product Filtering Issues
#### 4.3.1 No Products Pass Filters
**Problem:** `filter:products` completes but creates 0 filtered products
**Possible Causes:**
1. **Price Filters Too Restrictive:** Increase `--max-price` parameter
2. **Delivery Filters Too Restrictive:** Increase `--max-delivery` parameter
3. **No Whitelisted Sellers:** Check that some sellers have WHITELIST approval status
4. **Shipping API Issues:** Products might lack shipping information
**Debugging Steps:**
```bash
# Check seller approval status
python main.py harvest:status
# Try with relaxed filters
python main.py filter:products --max-price 100 --max-delivery 30 --limit 1
# Check raw product data
python -c "
from src.common.database import get_db_session, Product, Seller
with get_db_session() as db:
products = db.query(Product).join(Seller).filter(Seller.approval_status == 'WHITELIST').limit(3).all()
print(f'Available products from whitelisted sellers: {len(products)}')
"
```
#### 4.3.2 Image Ingestion Fails
**Problem:** Products are filtered but no images are stored
**Solution:**
1. Check if products have image URLs in the raw data
2. Verify network connectivity for image downloads
3. Check image ingestion logs for specific error messages
### 4.4 Review Process Issues
#### 4.4.1 Export Creates Empty File
**Problem:** `review:export-pending` creates CSV with only headers
**Solution:**
- This means no sellers have PENDING status. Check status distribution:
```bash
python main.py harvest:status
```
#### 4.4.2 Import Fails with CSV Errors
**Problem:** `review:import-results` fails to read the CSV file
**Possible Causes & Solutions:**
1. **File Format Issues:** Ensure CSV uses proper encoding (UTF-8) and commas as delimiters
2. **Missing Columns:** Verify the CSV has required columns: `shop_id`, `approval_status`, `note`
3. **File Path Issues:** Ensure the file is saved as `data/reviewed_merchants.csv`
#### 4.4.3 Duplicate Detection Export Issues
**Problem:** `detect:export-suspects` creates empty CSV or fails
**Possible Causes & Solutions:**
1. **No REVIEW_SUSPECT products:** Check if duplicate detection found any ambiguous cases:
```bash
python main.py detect:status
```
2. **Missing master products:** REVIEW_SUSPECT products need valid master assignments:
```sql
SELECT product_id, status, duplicate_master_id
FROM product_status
WHERE status = 'REVIEW_SUSPECT' AND duplicate_master_id IS NULL;
```
3. **Directory creation issues:** Ensure the output directory exists or can be created
#### 4.4.4 Import Review Results Issues
**Problem:** `detect:import-reviewed` fails or produces unexpected results
**Common Issues & Solutions:**
1. **CSV Format Problems:**
- Ensure CSV has correct column headers (especially `suspect_product_id`, `review_decision`)
- Check for proper UTF-8 encoding
- Verify no extra commas or quotes in product titles
2. **Invalid Review Decisions:**
- Only use: `DUPLICATE`, `UNIQUE`, `UNCERTAIN` (case-insensitive)
- Empty review_decision cells are ignored
3. **Product Not Found Errors:**
- Verify product IDs in CSV match those in database
- Check if products were re-processed and IDs changed
4. **Master Reassignment Issues:**
- If unexpected master changes occur, check product prices:
```sql
SELECT product_id, target_sale_price
FROM filtered_products
WHERE product_id IN ('suspect_id', 'master_id');
```
**Debug with Dry Run:**
```bash
# Always test first with dry-run
python main.py detect:import-reviewed --input your_file.csv --dry-run
```
### 4.5 General Debugging Tips
#### 4.5.1 Enable Verbose Logging
Add detailed logging to see what's happening:
```python
# Add to any Python script for debugging
import logging
logging.basicConfig(level=logging.DEBUG)
```
#### 4.5.2 Database Inspection Commands
Use these commands to inspect database state:
```bash
# Count records in each table
python -c "
from src.common.database import get_db_session, Product, Seller, FilteredProduct, ProductImage
with get_db_session() as db:
print(f'Products: {db.query(Product).count()}')
print(f'Sellers: {db.query(Seller).count()}')
print(f'Filtered Products: {db.query(FilteredProduct).count()}')
print(f'Images: {db.query(ProductImage).count()}')
"
# Check recent activity
python -c "
from src.common.database import get_db_session, Product
from datetime import datetime, timedelta
with get_db_session() as db:
recent = datetime.now() - timedelta(hours=24)
recent_count = db.query(Product).filter(Product.created_at >= recent).count()
print(f'Products created in last 24h: {recent_count}')
"
```
#### 4.5.3 Configuration Verification
Verify your configuration is loaded correctly:
```bash
python -c "
from src.common.config import get_search_category, get_ignore_categories
print(f'Search category: {get_search_category()}')
print(f'Ignore categories: {get_ignore_categories()}')
"
```
### 4.6 Performance Troubleshooting
#### 4.6.1 Slow API Calls
**Problem:** Commands take very long to complete
**Solutions:**
1. **Reduce Batch Sizes:** Use smaller `--limit` values
2. **Check Network:** Verify stable internet connection
3. **API Rate Limits:** Add delays between requests if needed
#### 4.6.2 Database Performance
**Problem:** Database queries are slow
**Solutions:**
1. **Add Indexes:** Critical indexes should already exist, but verify with database tools
2. **Database Size:** Consider archiving old data if database grows very large
3. **Connection Pooling:** For high-volume usage, consider PostgreSQL over SQLite
### 4.7 Module C: Duplicate Detection Issues
#### 4.7.1 CLIP Model Loading Fails
**Problem:** `detect:duplicates` fails with CLIP model error
**Solutions:**
1. **Install Required Packages:**
```bash
pip install torch torchvision clip-by-openai
```
2. **Check Device Configuration:**
```bash
# Force CPU usage if GPU issues
CLIP_DEVICE=cpu
```
3. **Try Different Model:**
```bash
# Use smaller, more compatible model
CLIP_MODEL=ViT-B/32
```
#### 4.7.2 No Duplicates Found
**Problem:** `detect:duplicates` completes but finds no duplicates
**Possible Causes & Solutions:**
1. **Thresholds Too Strict:**
```bash
# Try more lenient thresholds
PHASH_DUPLICATE_THRESHOLD=5
PHASH_AMBIGUOUS_THRESHOLD=25
CLIP_DUPLICATE_THRESHOLD=0.80
```
2. **Insufficient Images:**
```bash
# Check if products have processed images
SELECT COUNT(*) FROM product_images WHERE phash IS NOT NULL;
```
3. **Products Not Filtered:**
```bash
# Ensure products are in filtered_products table
SELECT COUNT(*) FROM filtered_products;
```
#### 4.7.3 Too Many Duplicates Detected
**Problem:** Most products are marked as duplicates incorrectly
**Solutions:**
1. **Tighten Thresholds:**
```bash
# More strict detection
PHASH_DUPLICATE_THRESHOLD=1
PHASH_AMBIGUOUS_THRESHOLD=8
CLIP_DUPLICATE_THRESHOLD=0.95
```
2. **Check Image Quality:**
```bash
# Verify images are properly downloaded and processed
SELECT product_id, COUNT(*) FROM product_images
WHERE phash IS NOT NULL AND local_file_path IS NOT NULL
GROUP BY product_id LIMIT 10;
```
#### 4.7.4 CLIP Analysis Too Slow
**Problem:** Duplicate detection takes very long due to CLIP analysis
**Solutions:**
1. **Reduce Image Limits:**
```bash
CLIP_MAX_IMAGES_PER_PRODUCT=3
CLIP_IMAGE_ROLES=hero # Only analyze hero images
```
2. **Optimize pHash Thresholds:**
```bash
# Reduce ambiguous range to minimize CLIP usage
PHASH_DUPLICATE_THRESHOLD=3
PHASH_AMBIGUOUS_THRESHOLD=12
```
3. **Use GPU if Available:**
```bash
CLIP_DEVICE=cuda # If NVIDIA GPU available
```
#### 4.7.5 Memory Issues During Detection
**Problem:** Process runs out of memory during duplicate detection
**Solutions:**
1. **Process in Batches:**
```bash
python main.py detect:duplicates --limit 50
```
2. **Reduce CLIP Image Limits:**
```bash
CLIP_MAX_IMAGES_PER_PRODUCT=2
```
3. **Use Lighter CLIP Model:**
```bash
CLIP_MODEL=ViT-B/32 # Smaller than ViT-B/16
```
#### 4.7.6 Verification Commands
**Check Detection Status:**
```bash
# View current detection results
python main.py detect:status
# Check specific product status
sqlite3 test.db "SELECT * FROM product_status WHERE product_id = 'YOUR_PRODUCT_ID';"
# Find largest duplicate groups
sqlite3 test.db "
SELECT duplicate_master_id, COUNT(*) as group_size
FROM product_status
WHERE duplicate_master_id IS NOT NULL
GROUP BY duplicate_master_id
ORDER BY group_size DESC
LIMIT 10;
"
```
**Reset Detection Results:**
```bash
# Clear all detection results to start over
sqlite3 test.db "DELETE FROM product_status;"
# Re-run detection with new settings
python main.py detect:duplicates
```
어떠한 문서나 스크립트가 다른 **프로토콜 / 포트 / 호스트** 에 있는 리소스 사용하는 것을 제한하는 정책. 예를 들어, 다음과 같은 사이트에서 리소스를 다른 곳으로 요청한다고 하자.
* **Production MDB**: updated monthly.
This document outlines the mandatory procedures for developing and verifying VCR elements (shaders, manifests, and assets) to ensure high-fidelity, centered, and non-clipping renders.
http://localhost:8000