Data Science and Analytics

NLTK Dataset Generator: Create 10,000 Rows of Custom NLP Training Data from Any Topic

Name: NLTK Dataset Generator: Create 10,000 Rows of Custom NLP Training Data from Any Topic
Author: Claude Directory

Claude Directory December 5, 2025

0 copies 0 likes

Effortlessly generate 10,000 rows of NLTK-compatible datasets for NLP tasks by simply providing a topic. Perfect for data scientists and researchers needing quick, large-scale text data for model training and analysis.

Prompt

You are an expert NLTK Dataset Generator. Your task is to create a massive, high-quality dataset of exactly 10,000 rows optimized for Natural Language Toolkit (NLTK) usage, based on a user-provided topic. The dataset must be in CSV format for easy import into NLTK (e.g., via pandas then NLTK processing).

Follow these numbered steps precisely:

1. **Analyze the Topic**: Receive the user's topic (e.g., 'two Reddit users becoming e-friends'). Understand it deeply to generate diverse, realistic text data relevant to NLP tasks like tokenization, sentiment analysis, POS tagging, or corpora building.

2. **Define Dataset Structure**: Output a CSV with these exact columns:
   - `id`: Sequential integer from 1 to 10000.
   - `text`: A short paragraph or sentence (20-100 words) on the topic.
   - `label`: A category or sentiment (e.g., 'positive', 'negative', 'neutral', 'question', 'story') fitting the text.
   - `source_type`: Simulate origin like 'reddit_post', 'comment', 'tweet', 'article_snippet'.

3. **Ensure Diversity and Quality**:
   - Vary language: Mix formal/informal, questions/statements, emotions.
   - Realistic content: Natural, error-free English (unless topic specifies otherwise).
   - Balance labels: Roughly 30% positive, 30% negative/neutral, 20% questions, 20% stories.
   - No duplicates: Each row unique.
   - NLP-friendly: Include varied sentence structures, vocabulary, punctuation for robust training.

4. **Generate the Dataset**:
   - Produce EXACTLY 10,000 rows.
   - Start output with: '```csv\n' followed by headers, then data rows, end with '\n```'.
   - Make it downloadable/copy-paste ready.

5. **Validation**: Before final output, internally verify row count = 10,000, columns correct, content relevant.

User topic: [INSERT YOUR TOPIC HERE]

Generate the dataset now.

How to Use

Copy the prompt into ChatGPT or Claude, replace '[INSERT YOUR TOPIC HERE]' with your desired topic (e.g., 'climate change debates'), and run it. The AI will output a ready-to-use CSV dataset with 10,000 rows. Import into Python with pandas.read_csv(), then process with NLTK for tokenization, stemming, or model training.

Comments

More Prompts

View all

Research

ChatGPT Web Browsing Research Agent

Structured web research using ChatGPT's browsing capability. Systematic source evaluation, fact-checking, and synthesis with proper citations.

Community

Development

ChatGPT API Integration Blueprint

Design production-ready ChatGPT API integrations. Covers authentication, streaming, function calling, structured outputs, and cost optimization with the latest OpenAI SDK.

Community

Data Analysis

Advanced Data Analysis with Code Interpreter

Step-by-step data analysis pipeline using ChatGPT's Code Interpreter. Upload CSV/Excel files for cleaning, visualization, statistical analysis, and insights.

Community

Productivity

ChatGPT Memory & Personalization Optimizer

Optimize ChatGPT's memory feature for persistent context. Teaches how to structure memories, manage what's stored, and leverage personalization effectively.

Community

Creative

DALL-E 3 Prompt Engineering Master

Generate precise, creative DALL-E 3 prompts. Handles style specifications, aspect ratios, composition rules, and iterative refinement for stunning AI-generated images.

Community

Productivity

ChatGPT Canvas Collaborative Editor

Leverage ChatGPT Canvas mode for iterative document editing, code review, and collaborative writing with inline suggestions and tracked changes.

Community