The CKAN Archiver Extension will download all of a CKAN’s resources, for three purposes:
- Offer the user it as a ‘cached’ copy, in case the link becomes broken
- Tell the user (and publishers) if the link is broken, on both the dataset/resource and in a ‘Broken Links’ report
- The downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-packagezip
When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table and added to dataset during the package_show call, so the information is also available over the API.
Other extensions can subscribe to the archiver’s IPipe interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.
Archiver works on Celery queues (CKAN 2.6 and earlier) or background jobs (CKAN 2.7+), so when Archiver is notified of a dataset/resource being created or updated, it puts an ‘update request’ on a queue. You can start Celery/jobs with multiple processes, to archive in parallel.
By default, two queues are used:
1. ‘bulk’ for a regular archival of all the resources
2. ‘priority’ for when a user edits one-off resource
This means that the ‘bulk’ queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the ‘priority’ queue.
Features:
- Automatic resource caching and archiving
- Broken link detection and reporting
- IPipe interface for integration with other extensions
- Configurable file size limits and formats
- Queue-based processing with priority levels
- Comprehensive broken links report
- Web server integration for serving cached files
- SNI support for HTTPS resources
- Support for both Celery (pre-2.7) and CKAN jobs (2.7+)