A CKAN extension for automatically extracting text and metadata from datasets.
ckanext-extractor automatically extracts text and metadata from your resources and adds them to the search index so that they can be used to find your data.
The extension uses Apache Solr’s text extraction capabilities to extract fulltext and metadata from various file formats including PDF, Office documents, and many others. The extracted content is then made searchable through CKAN’s search interface.
Key Features:
- Automatic text and metadata extraction from uploaded resources
- Background job processing to avoid blocking the web server
- Configurable file format support (PDF, Office documents, etc.)
- Configurable metadata field indexing
- Admin API for managing extraction tasks
- Paster commands for batch operations
- Plugin interfaces for custom post-processing
- Support for custom authentication headers during file download
Requirements:
- CKAN 2.6 or later
- Apache Solr with text extraction libraries enabled
- Background job worker (CKAN 2.7+ built-in, or ckanext-rq for earlier versions)
The extension integrates seamlessly with CKAN’s existing search infrastructure and provides both automatic processing for new/updated resources and manual tools for processing existing content.