Extension Archiver


Extension Basics

Title
Archiver
Name
ckanext-archiver
Type
Core extension
Description
Downloads and archives all CKAN resources to provide cached copies, monitor broken links, and enable file analysis by other extensions.
CKAN versions
Download-Url (zip)
Download-Url commit date
2025-04-07
Url to repo
Category
Cloud Infrastructure & Storage


Background Infos

Description (long)
Show details

The CKAN Archiver Extension will download all of a CKAN’s resources, for three purposes:

  1. Offer the user it as a ‘cached’ copy, in case the link becomes broken
  2. Tell the user (and publishers) if the link is broken, on both the dataset/resource and in a ‘Broken Links’ report
  3. The downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-packagezip

When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table and added to dataset during the package_show call, so the information is also available over the API.

Other extensions can subscribe to the archiver’s IPipe interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.

Archiver works on Celery queues (CKAN 2.6 and earlier) or background jobs (CKAN 2.7+), so when Archiver is notified of a dataset/resource being created or updated, it puts an ‘update request’ on a queue. You can start Celery/jobs with multiple processes, to archive in parallel.

By default, two queues are used: 1. ‘bulk’ for a regular archival of all the resources 2. ‘priority’ for when a user edits one-off resource

This means that the ‘bulk’ queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the ‘priority’ queue.

Features: - Automatic resource caching and archiving - Broken link detection and reporting - IPipe interface for integration with other extensions - Configurable file size limits and formats - Queue-based processing with priority levels - Comprehensive broken links report - Web server integration for serving cached files - SNI support for HTTPS resources - Support for both Celery (pre-2.7) and CKAN jobs (2.7+)

Version
2.0.0
Version release date
2015-11-01
Contact name
Open Knowledge / Cabinet Office
Contakt email
Contact Url
(not set)


Installation Guide

Configuration hints

Requires ckanext-report extension to be installed. Requires Celery queue backend (Redis or RabbitMQ) for CKAN < 2.7, or background jobs for CKAN 2.7+. Configure web server to serve archived files from archive_dir. Database tables must be initialized with paster commands. SNI support may be needed for HTTPS resources. Set up cron job for nightly report generation.

Plugins to configure (ckan.ini)
archiver report
CKAN Settings (ckan.ini)
# Path to directory where archived files will be saved # ckanext-archiver.archive_dir = /www/resource_cache # URL where cached files are publicly served # ckanext-archiver.cache_url_root = http://mysite.com/resource_cache # Maximum size in bytes of files to archive (default 50MB) # ckanext-archiver.max_content_length = 50000000 # User agent string for archiver requests # ckanext-archiver.user_agent_string = CKAN-Archiver/2.0 # Whether to verify HTTPS connections # ckanext-archiver.verify_https = true # CKAN site URL for API access (required) # ckan.site_url = http://mysite.com # Internal network name if different from site_url # ckan.site_url_internally = http://internal-ckan-server
DB migration to be executed
(not set)
<< back to Extensions