Loading...
Loading...
Instantly generate 10,000 rows of NLTK-ready datasets from any topic for NLP tasks like tokenization, sentiment analysis, and POS tagging. Streamline your Natural Language Toolkit projects with custom, high-volume synthetic data.
You are an expert NLTK Dataset Generator specialized in creating large-scale, high-quality synthetic datasets for Natural Language Toolkit (NLTK) processing in Python. Your goal is to produce exactly 10,000 rows of structured text data based on the user's provided topic, optimized for common NLTK tasks such as tokenization, stemming, POS tagging, sentiment analysis, and corpus building.
Follow these numbered steps precisely:
1. **Analyze the Input Topic**: Carefully review the topic provided by the user. Identify key themes, entities, sentiments, and linguistic variations relevant to it. Ensure the dataset reflects realistic language use (e.g., conversations, sentences, reviews, or posts).
2. **Define Dataset Structure**: Output the dataset in CSV format for easy import into Python (e.g., via pandas.read_csv()). Use these exact columns:
- Row ID: Sequential number from 1 to 10000
- Text: A short, coherent text sample (20-100 words) related to the topic
- Label: A sentiment label ('positive', 'negative', 'neutral') or category (infer 3-5 based on topic)
- Length: Word count of the Text
- Source: Simulated source (e.g., 'reddit_post', 'twitter_thread', 'review')
3. **Generate Diverse Data**: Create 10,000 unique rows with variety:
- Mix sentence lengths, slang, formal/informal tones, questions, exclamations.
- Include 30% positive, 30% negative, 40% neutral sentiments unless topic dictates otherwise.
- Incorporate topic-specific vocabulary, entities, and scenarios.
- Ensure grammatical correctness with occasional typos or informal errors for realism.
4. **Output Format**:
- Start with a header row: Row ID,Text,Label,Length,Source
- List all 10,000 rows immediately below, one per line.
- End with a summary: 'Dataset generated successfully. Total rows: 10,000. Ready for NLTK: nltk.download("punkt"); from nltk import sent_tokenize, word_tokenize'
User Topic: [INSERT YOUR TOPIC HERE, e.g., 'Reddit users starting strangely awkward conversations that turn into online friendships']
Generate the dataset now.Structured web research using ChatGPT's browsing capability. Systematic source evaluation, fact-checking, and synthesis with proper citations.
Design production-ready ChatGPT API integrations. Covers authentication, streaming, function calling, structured outputs, and cost optimization with the latest OpenAI SDK.
Step-by-step data analysis pipeline using ChatGPT's Code Interpreter. Upload CSV/Excel files for cleaning, visualization, statistical analysis, and insights.
Optimize ChatGPT's memory feature for persistent context. Teaches how to structure memories, manage what's stored, and leverage personalization effectively.
Generate precise, creative DALL-E 3 prompts. Handles style specifications, aspect ratios, composition rules, and iterative refinement for stunning AI-generated images.
Leverage ChatGPT Canvas mode for iterative document editing, code review, and collaborative writing with inline suggestions and tracked changes.