How to setup DMA external node
This is a short explanation on how to become/set up an external Node of DMA (Data Market Austria).
The figure below displays the main concept: TODO add figure
In the following the separate parts are explained.
Mapper Builder - Create Mapping
To translate your metadata into DMA metadata you have to create a mapping. Each of your source (not dataset or file but rather each metadata structure you use) needs a separate mapping. In most cases all your metadata will have the same structure/source so one mapping is enough. The mapping can be created using the Mapping Builder. This creates a .rml.ttl file. Download this file and save it for next steps.
Crawler - Harvester
The crawler is the connection between your metadata and DMA. The basic idea is to fetch your metadata and push it to the deduper. This is the part where you have to code something. For an example see crawler_template or the eodc implementation crawler_eodc.
Code your own crawler
bash git clone https://gitlab.com/datamarket/crawler_template.git
- Rename sources.default.json to sources.json and change its content
- For each source a map_ID must be specified (in sources.json)
- One source corresponds to one metadata structure (so if all your metadata uses the same structure one source is enough)
- map_ID is the ID of the metadata schema specified in the metadata mapper service
get_metadatas() does the actual work. It fetches and then yields your metadata. A example can be found in example_source.py in the crawler_sources directory. - Copy example_source.py and rename it to one of your map_IDs (there has to be one file per map_ID - corresponding to a source - inside crawler_sources) - Inside the file there must be a function get_metadatas() which yields your metadata - so first fetch your metadata (e.g.: via post requests) - then yield metadata for one record after the other (the metadata is then pushed to the deduper. This is handled inside crawler.py - already implemented) - The metadata which is yield needs to have certain characteristics: - must have the exact same structure as the metadata used to create the RML mapping - so all the fields specified in the mapping builder must be given - the crawler as it is implemented in the template only supports metadata in json or dict format. If you use for instance xml some minor changes have to be done in the crawler.py file (which pushes the metadata to the deduper)
Setup crawler instance
To use your crawler set it up as a docker container:
Navigate into crawler repository
bash cd /path/to/crawler
Build and run docker
bash docker build -t crawler . docker run -e ENV_VAR1='foo' -e ENV_VAR2='bar' crawler
There are several important environment variables that need to be set:
X_DMA_CRAWLER_INGEST_URL: the endpoint of the deduper (e.g.: https://dedup.datamarket.at). As an external node you also have to set up your own deduper which needs a url. Add this url here.
CAS_CLIENT_ID: OAuth2 client_id. Your service needs to be registered in CAS, where you get the
CAS_CLIENT_SECRET: OAuth2 client_secret
CAS_AUTH_TOKEN_URL: OAuth2 services auth token URL
Add them as shown in the example
New metadata should continuously be pushed, therefore setup the crawler to as a cron job to run on a regular basis. See example in crawler_template
The deduper receives metadata from the crawler. It has two main purposes:
- Makes sure no data is inserted twice into DMA, but datasets are updated if they changed
There are three scenarios: 1. Dataset is new - there is no corresponding entry in DMA 1. Dataset is already in DMA and did not change 1. Dataset is already in DMA but was updated
To achieve this the deduper uses a database (mySQL) storing all datasets and their hashes. So each time the deduper gets a dataset (more specifically metadata of a dataset) from the crawler it checks if this dataset is already in the database.
1. applies if there is no corresponding entry in the database - dataset needs to be inserted 1. applies if there is a corresponding entry in the database and also the hash is the same - dataset can be dismissed 1. applies if there is a corresponding entry in the database but the hash is different - dataset needs to be updated
- Call the Metadata Mapper to translate/map you metadata to DMA metadata using the before created mapping (explained in the next section)
Create mySQL database
bash docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=XXX -e MYSQL_USER=XXX -e MYSQL_PASSWORD=XXX -e MYSQL_DATABASE=dedupdb -d mysql:5.7Add a meaningful root password, username, and user password. Keep them, will be needed in the settings file
Clone deduper and navigate into the cloned repository:
bash git clone https://gitlab.com/datamarket/dedup.git cd /path/to/dedup
Create/Modify settings file
Inside the dedup repository is the folder deployment_extras. Inside this folder is a file called settings_template.cfg Create a folder settings on the root level of the dedup repository and copy this file into this new folder.
bash mkdir settings cp deployment_extras/settings_template.cfg settings/
Inside this file several settings have to be modified. TODO: More explicit explanation to come.
Build an run docker images
bash docker build -t dma/dedup:0.1 . docker run --name dedup -p8000:8000 -it dma/dedup:0.1
Component to map your metadata to DMA metadata format (based on DCAT-AP). Is connected to deduper.
Setup metadata mapper
Clone Metadata Mapper and navigate into the cloned repository:
bash git clone https://gitlab.com/datamarket/dma-simple-RML-mapper.git cd /path/to/dma-simple-RML-mapper
Create volume to store mappings:
bash docker volume create mappingsvol
Build and run docker images:
bash docker build --rm -t dma/metadata_mapper . docker run --name=mappercontainer -v mappingsvol:/mapping_files -p 8080:8080 -it dma/metadata_mapperYou can also specify a user (no root access) with e.g.:
Container is now running on http://localhost:8080/RMLMapper
Insert your mappings
Now the .rml.ttl file(s) created in the Mapper Builder section is(are) used. Insert these files into the Metadata Mapper: - http://localhost:8080/RMLMapper/mapper/insertMapping/?temp_id=MAPPINGNAME (POST)
- Add the maaping in the POST body - MAPPINGNAME is the name of your mapping, the resulting map_ID (needed for the crawler) then is INSERTED_MAPPINGS_MAPPINGNAME - Example insert using curl:
bash curl -d "@rmlfile.ttl" -X POST http://localhost:8080/RMLMapper/mapper/insertMapping/?temp_id=MAPPINGNAME
Test your mapper
To check everything worked fine try to map a file with the inserted mapping: - http://localhost:8080/RMLMapper/mapper/mappingLibrary/MAPPINGNAME?format=FMT&temp_id=TEMPID (POST) - Add the file which should be mapped in the body of the POST request - FMT file format, can either be xml or json - TEMPID is the id of the file (for testing you can choose anything) - Example mapping test:
bash curl -d "@jsonfile.json" -X POST http://localhost:8080/RMLMapper/mapper/mappingLibrary/INSERTED_MAPPINGS_MAPPINGNAME?format=json&temp_id=ID_of_this_file -H 'Content-Type: application/x-www-form-urlencoded'
For more examples see the doc of the Metadata Mapper