Extension Extractor

Title	Extractor
Name	ckanext-extractor
Type	Public extension
Description	A full text and metadata extractor for CKAN that automatically extracts text and metadata from resources and adds them to the search index.
CKAN versions	~2.6,~2.7,~2.8 Show details These CKAN Versions are exactely matched: 2.10.0 2.10.1 2.10.2 2.10.3 2.10.4 2.10.5 2.10.6 2.10.7 2.10.8 2.11.0 2.11.1 2.11.2 2.11.3 2.12.0 2.6.0 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.6.6 2.6.7 2.6.8 2.6.9 2.7.0 2.7.1 2.7.10 2.7.11 2.7.12 2.7.2 2.7.3 2.7.4 2.7.5 2.7.6 2.7.7 2.7.8 2.7.9 2.8.0 2.8.1 2.8.10 2.8.11 2.8.12 2.8.2 2.8.3 2.8.4 2.8.5 2.8.6 2.8.7 2.8.8 2.8.9 2.9.0 2.9.1 2.9.10 2.9.11 2.9.2 2.9.3 2.9.4 2.9.5 2.9.6 2.9.7 2.9.8 2.9.9
Download-Url (zip)	https://github.com/stadt-karlsruhe/ckanext-extractor.git#egg=ckanext-extractor
Download-Url commit date	2018-11-21
Url to repo	https://github.com/stadt-karlsruhe/ckanext-extractor
Category	Data Management & Quality

Description (long)	Show details A CKAN extension for automatically extracting text and metadata from datasets. ckanext-extractor automatically extracts text and metadata from your resources and adds them to the search index so that they can be used to find your data. The extension uses Apache Solr’s text extraction capabilities to extract fulltext and metadata from various file formats including PDF, Office documents, and many others. The extracted content is then made searchable through CKAN’s search interface. Key Features: - Automatic text and metadata extraction from uploaded resources - Background job processing to avoid blocking the web server - Configurable file format support (PDF, Office documents, etc.) - Configurable metadata field indexing - Admin API for managing extraction tasks - Paster commands for batch operations - Plugin interfaces for custom post-processing - Support for custom authentication headers during file download Requirements: - CKAN 2.6 or later - Apache Solr with text extraction libraries enabled - Background job worker (CKAN 2.7+ built-in, or ckanext-rq for earlier versions) The extension integrates seamlessly with CKAN’s existing search infrastructure and provides both automatic processing for new/updated resources and manual tools for processing existing content.
Version	0.4.0
Version release date	2018-01-29
Contact name	Stadt Karlsruhe
Contakt email	transparenz@karlsruhe.de
Contact Url	(not set)

Configuration hints	Requires Apache Solr with text extraction libraries enabled. Configure solrconfig.xml to include extraction libraries and schema.xml for dynamic fields. Start background worker for extraction jobs. Initialize database tables with paster command. Configure file formats and metadata fields to extract via ckan.ini settings.
Plugins to configure (ckan.ini)	extractor
CKAN Settings (ckan.ini)	# Formats for which extraction should be performed (space-separated, case-insensitive, wildcards allowed) # ckanext.extractor.indexed_formats = pdf txt docx xlsx pptx # Metadata fields to index after extraction (space-separated, case-insensitive, wildcards allowed) # ckanext.extractor.indexed_fields = fulltext author title subject creator
DB migration to be executed	(not set)