Skip to content

How to setup DMA external node

This is a short explanation on how to become/set up an external Node of DMA (Data Market Austria).

The figure below displays the main concept: TODO add figure

In the following the separate parts are explained.

Mapper Builder - Create Mapping

To translate your metadata into DMA metadata you have to create a mapping. Each of your source (not dataset or file but rather each metadata structure you use) needs a separate mapping. In most cases all your metadata will have the same structure/source so one mapping is enough. The mapping can be created using the Mapping Builder. This creates a .rml.ttl file. Download this file and save it for next steps.

Crawler - Harvester

The crawler is the connection between your metadata and DMA. The basic idea is to fetch your metadata and push it to the deduper. This is the part where you have to code something. For an example see crawler_template or the eodc implementation crawler_eodc.

Code your own crawler

  1. Clone crawler_template

    bash git clone https://gitlab.com/datamarket/crawler_template.git

  2. Setup sources

    • Rename sources.default.json to sources.json and change its content
    • For each source a map_ID must be specified (in sources.json)
      • One source corresponds to one metadata structure (so if all your metadata uses the same structure one source is enough)
      • map_ID is the ID of the metadata schema specified in the metadata mapper service
  3. Create get_metadatas()

    get_metadatas() does the actual work. It fetches and then yields your metadata. A example can be found in example_source.py in the crawler_sources directory. - Copy example_source.py and rename it to one of your map_IDs (there has to be one file per map_ID - corresponding to a source - inside crawler_sources) - Inside the file there must be a function get_metadatas() which yields your metadata - so first fetch your metadata (e.g.: via post requests) - then yield metadata for one record after the other (the metadata is then pushed to the deduper. This is handled inside crawler.py - already implemented) - The metadata which is yield needs to have certain characteristics: - must have the exact same structure as the metadata used to create the RML mapping - so all the fields specified in the mapping builder must be given - the crawler as it is implemented in the template only supports metadata in json or dict format. If you use for instance xml some minor changes have to be done in the crawler.py file (which pushes the metadata to the deduper)

Setup crawler instance

To use your crawler set it up as a docker container:

  1. Navigate into crawler repository

    bash cd /path/to/crawler

  2. Build and run docker

    bash docker build -t crawler . docker run -e ENV_VAR1='foo' -e ENV_VAR2='bar' crawler

    There are several important environment variables that need to be set:

    • X_DMA_CRAWLER_INGEST_URL: the endpoint of the deduper (e.g.: https://dedup.datamarket.at). As an external node you also have to set up your own deduper which needs a url. Add this url here.
    • CAS_CLIENT_ID: OAuth2 client_id. Your service needs to be registered in CAS, where you get the CAS_CLIENT_ID, CAS_CLIENT_SECRET and the CAS_AUTH_TOKEN_URL from.
    • CAS_CLIENT_SECRET: OAuth2 client_secret
    • CAS_AUTH_TOKEN_URL: OAuth2 services auth token URL
    • optional: X_DMA_CRAWLER_DEBUG: set True for debug

    Add them as shown in the example

  3. Run continuously

    New metadata should continuously be pushed, therefore setup the crawler to as a cron job to run on a regular basis. See example in crawler_template

Deduper

The deduper receives metadata from the crawler. It has two main purposes:

  1. Makes sure no data is inserted twice into DMA, but datasets are updated if they changed

There are three scenarios: 1. Dataset is new - there is no corresponding entry in DMA 1. Dataset is already in DMA and did not change 1. Dataset is already in DMA but was updated

To achieve this the deduper uses a database (mySQL) storing all datasets and their hashes. So each time the deduper gets a dataset (more specifically metadata of a dataset) from the crawler it checks if this dataset is already in the database.

1. applies if there is no corresponding entry in the database - dataset needs to be inserted
1. applies if there is a corresponding entry in the database and also the hash is the same - dataset can be dismissed
1. applies if there is a corresponding entry in the database but the hash is different - dataset needs to be updated
  1. Call the Metadata Mapper to translate/map you metadata to DMA metadata using the before created mapping (explained in the next section)

Setup deduper

  1. Create mySQL database

    bash docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=XXX -e MYSQL_USER=XXX -e MYSQL_PASSWORD=XXX -e MYSQL_DATABASE=dedupdb -d mysql:5.7 Add a meaningful root password, username, and user password. Keep them, will be needed in the settings file

  2. Clone deduper and navigate into the cloned repository:

    bash git clone https://gitlab.com/datamarket/dedup.git cd /path/to/dedup

  3. Create/Modify settings file

    Inside the dedup repository is the folder deployment_extras. Inside this folder is a file called settings_template.cfg Create a folder settings on the root level of the dedup repository and copy this file into this new folder.

    bash mkdir settings cp deployment_extras/settings_template.cfg settings/

    Inside this file several settings have to be modified. TODO: More explicit explanation to come.

  4. Build an run docker images

    bash docker build -t dma/dedup:0.1 . docker run --name dedup -p8000:8000 -it dma/dedup:0.1

Metadata Mapper

Component to map your metadata to DMA metadata format (based on DCAT-AP). Is connected to deduper.

Setup metadata mapper

  1. Clone Metadata Mapper and navigate into the cloned repository:

    bash git clone https://gitlab.com/datamarket/dma-simple-RML-mapper.git cd /path/to/dma-simple-RML-mapper

  2. Create volume to store mappings:

    bash docker volume create mappingsvol

  3. Build and run docker images:

    bash docker build --rm -t dma/metadata_mapper . docker run --name=mappercontainer -v mappingsvol:/mapping_files -p 8080:8080 -it dma/metadata_mapper You can also specify a user (no root access) with e.g.: --user 9265065:9265065

Container is now running on http://localhost:8080/RMLMapper

Usage

  1. Insert your mappings

    Now the .rml.ttl file(s) created in the Mapper Builder section is(are) used. Insert these files into the Metadata Mapper: - http://localhost:8080/RMLMapper/mapper/insertMapping/?temp_id=MAPPINGNAME (POST)
    - Add the maaping in the POST body - MAPPINGNAME is the name of your mapping, the resulting map_ID (needed for the crawler) then is INSERTED_MAPPINGS_MAPPINGNAME - Example insert using curl: bash curl -d "@rmlfile.ttl" -X POST http://localhost:8080/RMLMapper/mapper/insertMapping/?temp_id=MAPPINGNAME

  2. Test your mapper

    To check everything worked fine try to map a file with the inserted mapping: - http://localhost:8080/RMLMapper/mapper/mappingLibrary/MAPPINGNAME?format=FMT&temp_id=TEMPID (POST) - Add the file which should be mapped in the body of the POST request - FMT file format, can either be xml or json - TEMPID is the id of the file (for testing you can choose anything) - Example mapping test: bash curl -d "@jsonfile.json" -X POST http://localhost:8080/RMLMapper/mapper/mappingLibrary/INSERTED_MAPPINGS_MAPPINGNAME?format=json&temp_id=ID_of_this_file -H 'Content-Type: application/x-www-form-urlencoded'

For more examples see the doc of the Metadata Mapper