Extension Remote Data Harvesting


Extension Basics

Title
Remote Data Harvesting
Name
ckanext-harvest
Type
Public extension
Description
Remote harvesting system for importing datasets from external data sources and catalogs automatically.
CKAN versions
Download-Url (zip)
Download-Url commit date
2020-06-18
Url to repo
Category
Data Management & Quality


Background Infos

Description (long)
Show details

The Remote Data Harvesting extension provides a comprehensive framework for automatically importing datasets from external data sources, remote catalogs, and third-party APIs, enabling federated data management and seamless integration with distributed data ecosystems. This powerful extension supports multiple harvesting protocols including CKAN API, CSW (Catalog Service for Web), WAF (Web Accessible Folder), DCAT-RDF, and custom harvester implementations for specialized data sources. The system provides scheduled harvesting with configurable intervals, incremental updates, and intelligent change detection to minimize processing overhead while ensuring data freshness. Advanced features include data transformation pipelines, metadata mapping and enrichment, validation workflows, and conflict resolution mechanisms for handling duplicate datasets. The extension supports hierarchical harvesting configurations, multi-source aggregation, and distributed harvesting across multiple CKAN instances for large-scale data federation. Administrative tools provide harvesting status monitoring, job scheduling management, error handling and retry mechanisms, and comprehensive logging for troubleshooting. Integration capabilities include webhook notifications, API endpoints for external triggering, and integration with data quality assessment tools. Performance optimizations enable handling of large-scale harvesting operations with batch processing, parallel job execution, and resource management controls. Essential for data portals aggregating content from multiple sources, government platforms implementing open data federation, research networks sharing datasets across institutions, and organizations requiring automated data synchronization from diverse external systems where centralized data discovery and distributed data management are critical for comprehensive data accessibility.

Version
Latest
Version release date
2020-06-18
Contact name
Datopian Team
Contakt email
Contact Url
(not set)


Installation Guide

Configuration hints

Supports multiple harvesting protocols with scheduled job management

Plugins to configure (ckan.ini)
harvest ckan_harvester
CKAN Settings (ckan.ini)
# ckanext.harvest.mq.type = redis # ckanext.harvest.mq.hostname = localhost # ckanext.harvest.mq.port = 6379 # ckanext.harvest.mq.redis_db = 1 # ckanext.harvest.user_agent = CKAN Harvester # ckanext.harvest.status_mail.errored = true
DB migration to be executed
harvest initdb
<< back to Extensions