Extension Extractor


Extension Basics

Title
Extractor
Name
ckanext-extractor
Type
Public extension
Description
A full text and metadata extractor for CKAN that automatically extracts text and metadata from resources and adds them to the search index.
CKAN versions
Download-Url (zip)
Download-Url commit date
2018-11-21
Url to repo
Category
Data Management & Quality


Background Infos

Description (long)
Show details

A CKAN extension for automatically extracting text and metadata from datasets.

ckanext-extractor automatically extracts text and metadata from your resources and adds them to the search index so that they can be used to find your data.

The extension uses Apache Solr’s text extraction capabilities to extract fulltext and metadata from various file formats including PDF, Office documents, and many others. The extracted content is then made searchable through CKAN’s search interface.

Key Features: - Automatic text and metadata extraction from uploaded resources - Background job processing to avoid blocking the web server - Configurable file format support (PDF, Office documents, etc.) - Configurable metadata field indexing - Admin API for managing extraction tasks - Paster commands for batch operations - Plugin interfaces for custom post-processing - Support for custom authentication headers during file download

Requirements: - CKAN 2.6 or later - Apache Solr with text extraction libraries enabled - Background job worker (CKAN 2.7+ built-in, or ckanext-rq for earlier versions)

The extension integrates seamlessly with CKAN’s existing search infrastructure and provides both automatic processing for new/updated resources and manual tools for processing existing content.

Version
0.4.0
Version release date
2018-01-29
Contact name
Stadt Karlsruhe
Contakt email
Contact Url
(not set)


Installation Guide

Configuration hints

Requires Apache Solr with text extraction libraries enabled. Configure solrconfig.xml to include extraction libraries and schema.xml for dynamic fields. Start background worker for extraction jobs. Initialize database tables with paster command. Configure file formats and metadata fields to extract via ckan.ini settings.

Plugins to configure (ckan.ini)
extractor
CKAN Settings (ckan.ini)
# Formats for which extraction should be performed (space-separated, case-insensitive, wildcards allowed) # ckanext.extractor.indexed_formats = pdf txt docx xlsx pptx # Metadata fields to index after extraction (space-separated, case-insensitive, wildcards allowed) # ckanext.extractor.indexed_fields = fulltext author title subject creator
DB migration to be executed
(not set)
<< back to Extensions