Requirements:
- CKAN 2.6+
- Python 2.6 or 2.7
- Network access to remote CKAN instances
- Remote portals must have public package_search API
Installation:
Activate CKAN virtualenv:
. /usr/lib/ckan/default/bin/activate
Install extension:
pip install ckanext-searchfed
Or from source:
git clone https://github.com/DataShades/ckanext-searchfed.git
cd ckanext-searchfed
python setup.py develop
Install dependencies:
pip install -r dev-requirements.txt
Add plugin to ckan.plugins in production.ini:
ckan.plugins = … searchfed …
Configure remote portals (see Configuration below)
Restart CKAN:
sudo service apache2 reload
Configuration:
Required Settings:
Remote Portal Configuration:
Define labels and URLs for remote CKAN instances.
Format: label URL [label URL …]
Multiple remote portals
ckan.search_federation = data.brisbane.gov.au https://data.brisbane.qld.gov.au/data data.gov.au http://data.gov.au data.sa.gov.au https://data.sa.gov.au
Label will appear next to dataset titles in search results:
“FROM DATA.BRISBANE.GOV.AU”
Optional Settings:
Filter Label (for excluding harvested content):
Labels used to filter out already-harvested datasets.
Default: empty string
ckan.search_federation.label = data.sa.gov.au
When set, creates filter query: -harvest_portal:data.sa.gov.au
Extra Filter Keys:
Field names used for filtering remote queries.
Default: ‘harvest_portal’
ckan.search_federation.extra_keys = harvest_portal search_federation_portal
Prevents showing datasets already harvested from remote portals.
Use Remote Facets:
Use facets from remote search instead of local facets.
Default: false
ckan.search_federation.use_remote_facet_results = false
If true and remote results include ‘search_facets’, they replace local facets.
Minimum Search Results Threshold:
Trigger federation when local results below this number.
Default: 20
Set to -1 to always run federation regardless of local results.
ckan.search_federation.min_search_results = 3
If local results < 3, fetch remote results.
If set to -1, always fetch remote results.
API Federation Control:
Include remote datasets in API search results.
Default: false
ckan.search_federation.api_federation = false
If true: API searches include remote datasets
If false: Only web UI shows federated results
Source Facet Field:
Facet field identifying source portal for each dataset.
Default: empty string
ckan.search_federation.source_facet_field = vocab_source_portal
Used for merging facet counts and building “Source” filter.
Source Extras Key:
Key in search_params[“extras”] carrying selected source portals.
Default: empty string
ckan.search_federation.source_extras_key = source_portal
Searchfed checks this key to decide if remote portal should be queried.
If user hasn’t selected this source, remote call is skipped.
Configuration Examples:
Basic Federation (always supplement with 2 portals):
ckan.plugins = … searchfed …
ckan.search_federation = data.brisbane.gov.au https://data.brisbane.qld.gov.au/data data.gov.au http://data.gov.au
ckan.search_federation.min_search_results = -1
Conditional Federation (only when < 5 local results):
ckan.plugins = … searchfed …
ckan.search_federation = data.gov.au http://data.gov.au
ckan.search_federation.min_search_results = 5
Advanced with Filtering:
ckan.plugins = … searchfed …
ckan.search_federation = data.brisbane.gov.au https://data.brisbane.qld.gov.au/data data.gov.au http://data.gov.au
ckan.search_federation.label = data.brisbane.gov.au
ckan.search_federation.extra_keys = harvest_portal search_federation_portal
ckan.search_federation.min_search_results = 10
ckan.search_federation.api_federation = true
ckan.search_federation.source_facet_field = vocab_source_portal
ckan.search_federation.source_extras_key = source_portal
Usage:
Automatic Federated Search:
- User performs search query
- CKAN executes local search
- If local results < min_search_results:
- Extension queries configured remote portals
- Filters out already-harvested datasets
- Checks user’s source portal selection
- Merges remote results with local results
- Remote results appear below local results with source labels
Search Result Display:
Local Results (top):
- Dataset 1 (local)
- Dataset 2 (local)
- Dataset 3 (local)
Federated Results (below, with labels):
- Dataset 4 FROM DATA.BRISBANE.GOV.AU
- Dataset 5 FROM DATA.GOV.AU
- Dataset 6 FROM DATA.BRISBANE.GOV.AU
Source Filtering:
If source_facet_field configured, users can filter by source:
Facet: “Source Portal”
- Local Portal (15)
- data.brisbane.gov.au (8)
- data.gov.au (12)
Selecting a source filters to show only those datasets.
API Behavior:
With api_federation = false (default):
- Web UI: Shows federated results
- API calls: Return only local results
With api_federation = true:
- Web UI: Shows federated results
- API calls: Include remote results in response
Development:
Clone repository:
git clone https://github.com/DataShades/ckanext-searchfed.git
cd ckanext-searchfed
Install for development:
python setup.py develop
pip install -r dev-requirements.txt
Create test.ini from template
Run tests:
pytest –ckan-ini test.ini
Troubleshooting:
No remote results appearing:
- Verify remote portal URLs are accessible
- Check min_search_results threshold
- Test remote API: curl “https://remote-portal/api/3/action/package_search?q=test”
- Review CKAN logs for connection errors
- Verify ckan.search_federation is configured
Duplicate results (harvested + federated):
- Configure ckan.search_federation.label
- Set ckan.search_federation.extra_keys
- Verify harvest_portal field populated on harvested datasets
- Check filter query is being applied
Performance issues:
- Reduce number of remote portals
- Increase min_search_results threshold
- Set api_federation = false for API performance
- Consider caching remote results
- Monitor remote API response times
Facet merging problems:
- Verify remote portals return search_facets
- Check use_remote_facet_results setting
- Ensure facet field names match
- Review facet configuration on remote portals
Source filtering not working:
- Verify source_facet_field matches actual field
- Check source_extras_key is correct
- Ensure facet is configured in both local and remote
- Review search parameters being sent
Performance Considerations:
Network Latency:
- Remote API calls add latency to searches
- Consider timeout settings
- Monitor remote portal availability
- Use CDN/caching for frequently accessed data
Result Limit:
- Federated results count towards total
- Balance local vs. remote result proportions
- Consider pagination implications
API Load:
- Frequent searches generate many remote API calls
- Implement rate limiting if needed
- Cache popular query results
- Consider batch/async queries
Best Practices:
Configuration:
- Start with min_search_results = 10 (don’t always federate)
- Use harvest_portal filtering to avoid duplicates
- Set api_federation = false unless specifically needed
- Configure source filtering for user control
Remote Portal Selection:
- Choose reliable, fast remote portals
- Verify APIs are public and stable
- Test thoroughly before production
- Monitor remote portal health
User Experience:
- Clearly label remote results
- Explain source of federated datasets
- Provide source filtering options
- Consider separate tabs for local/remote
Data Quality:
- Validate remote result formats
- Handle missing metadata gracefully
- Filter inappropriate/irrelevant results
- Monitor result quality
Use Cases:
Regional Data Portals:
- City portal federates with state/national portals
- Users discover relevant datasets from all levels
- Seamless cross-jurisdiction search
Thematic Networks:
- Health data portal federates with related portals
- Research data across institutions
- Collaborative data ecosystems
Partner Organizations:
- Organization federates with partner portals
- Shared data discovery
- Collaborative projects
Development Status: Beta (4)
License: AGPL v3.0 or later
Keywords: CKAN, search, federation, distributed, remote, API
Related Extensions:
- ckanext-harvest: Metadata harvesting
- ckanext-spatial: Spatial search
- ckanext-cloudstorage: Remote storage integration
- ckanext-scheming: Schema compatibility