#Wikipedia to JSONL Training Data Project

1 messages Β· Page 1 of 1 (latest)

stiff owl
#

Transform Wikipedia articles into high-quality AI training datasets with automatic subtitle detection, bulk processing, and dual format support.

WikiAI Converter is a powerful, free, and open-source tool that converts Wikipedia articles into structured JSONL (JSON Lines) format for AI model training. Perfect for creating instruction-following datasets, chat training data, and fine-tuning language models.

✨ Features

πŸ”§ Core Functionality

🎯 Multiple Input Methods: Single URL, bulk URL processing, or HTML file upload

πŸ“Š Dual Format Support: Instruction format (Alpaca/Vicuna) and Chat format (ChatGPT/Claude)

🏷️ Automatic Subtitle Detection: Intelligently combines main title with subsections

🧹 Smart Text Cleaning: Removes citations, edit links, and normalizes whitespace

πŸ“¦ Bulk Processing: Process multiple Wikipedia URLs simultaneously

πŸ—œοΈ ZIP Archive Support: Download multiple files in a convenient ZIP package

🌟 Advanced Features

πŸ” Metadata Enrichment:

Source URL tracking
Language detection
Category extraction
Publication dates
Author information
Page type classification
Extraction timestamps
βš™οΈ Configurable Processing:
Include/exclude list items
Citation removal options
Edit link filtering
Custom title override

πŸŒ™ Modern UI: Dark/light theme with responsive design

πŸ‘€ Live Preview: See converted content before processing

πŸ”„ Version History

v3.0 (Current)
✨ Multi-URL bulk processing
πŸ” Metadata enrichment options
πŸŒ™ Dark/light theme support
πŸ“¦ ZIP archive downloads
🎨 Modern responsive UI

v2.0
🎯 Dual format support (Instruction/Chat)
🧹 Enhanced text cleaning
πŸ‘€ Live preview functionality

v1.0
πŸš€ Initial release
πŸ“„ Basic Wikipedia to JSONL conversion
πŸ”§ File upload support

restive yew
#

1 question....when training a model, dont you need input and reply? if you train off just data....how does it know what to reply with if there is no conversation/prediction. I am probably wrong but its just a thought....