Static inventory

Static datasets are data items which can be packaged as a downloadable entity and provided via different transfer methods, such as Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP) or Torrent. In contrast to open data sets, they are closed and are not being changed or extended continuously, such as current weather data, for example.

A static data set package consists of of the data itself and the metadata related to it. On the one hand, the data can contain different representations, i.e. the same data in different formats, such as a data table in Comma Separated Values (CSV) and Hierarchical Data Format (HDF5), for example. And, on the other hand, the metadata contains basic information about the data set, such as the title, description publisher, contact point, topic, keywords, etc. Additionally, there can be other metadata, e.g. regarding the provenance, i.e. when was the data set created or changed and by whom, or regarding the digital preservation, i.e. what actions have been taken to ensure long-term accessibility.

The structure of the data set package could be presented as follows:

,-------------------------------------------------------.
| Data Set {Persistent Unique Identifier}               |
|-------------------------------------------------------|
| - metadata/                                           | <--- Descriptive metadata
|     - metadata.xml                                    |
|-------------------------------------------------------|
| - distributions/                                      | <--- Distributions
|     - hf5_distribution/                               |
|         - data/                                       |
|             - file1.hf5                               |
|             - file2.hf5                               |
|         - metadata/                                   |
|     - csv_distribution/                               |
|         - data/                                       |
|             - file1.csv                               |
|             - file2.csv                               |
|         - metadata/                                   |
`-------------------------------------------------------'

The data set has a Persistent Unique Identifier (PUID) which is used to uniquely identify the information resource in the catalogue. This identifier will not change during the lifecycle of the data set. The following is an example of a DMA identifier:

info:dma/org:87fb7f9889f623a1b8b4f9ebec17c76888869d3e

It is URI identifier from the "info" scheme¹ with the - unregistered - namespace "dma", followed by the organisation specific namespace "org" and an alphanumeric identifier "87fb7f9889f623a1b8b4f9ebec17c76888869d3e" which is unique within that namespace.

The static data set package contains a metadata file for the descriptive information which covers the minimum information needed to publish a data set in the DMA. It is inspired by the W3C Data Catalog Vocabulary (DCAT)² metadata standard.

Conduit Data Management

Conduit³ is the DMA project's reference implementation of a data repository.

On the one hand, the reference implementation is installed as a central instance X and allows making use of the DMA's data management functionality.

On the one hand, the purpose is to demonstrate the implementation of the interfaces and core functions which are required to participate in the DMA, which basically are the following:

Authentication using the DMA user managment service (OAuth). doc
Creation of DCAT metadata compliant with the DMA's metadata core specification.⁵
ResourceSync⁶ interface which represents the changelist of the data set managed by a repository.
Use of the Blockchain API⁷ to:
- Retrieve persistent identifiers for commercial data sets.
- Creating offers for commercial data sets published in the DMA.
Use of the Authentication Gateway⁸ for registering data access points for downloading data set distributions.

Local test instance

Conduit is Python/Django-based web application which uses a MySQL database for storing information about data sets and a Celery/Redis backend for asynchronous task processing. A docker-compose configuration file⁹ allows easily setting up a local instance of the application to try out the data set creation, packaging, and storage functionality.

Get the Conduit repository from DMA's Gitlab instance:

git clone https://gitlab.com/datamarket/conduit.git

Use the settings file for docker deployement:

cp settings/settings.cfg.docker settings/settings.cfg

Bild the docker containers:

docker-compose build

Start the docker containers:

docker-compose up

Open the web page:

http://localhost:8000

Central instance

A central instance of the Conduit repository is currently deployed for testing purposes at:

http://conduit-sven.apps.dma-cloud.catalysts.cc

You will be redirected to the login screen where you can login with your user credentials as a data provider.

After logging in, navigate to "Submissions" / "New submission" to start a dataset submission.

Enter a label for the dataset (only alpha-numeric ASCII characters and numbers, dot ('.') and hypen ('-') allowed). Optionally, an external identifier can reference an existing identifier of the dataset.

Click on "Proceed" to go to the next step.

Provide a title and a description of the dataset. Additionally, at least one tag needs to be provided by typing the first two characters of a tag into the "Tags" field and then selecting it.

Click on "Continue" to go to the next step.

Provide a contact point with the corresponding e-mail address, a theme (select from suggested themes by typing the first characters), a publisher with the corresponding e-mail address, the main language of the dataset and if the general access modality.

Click on "Continue" to go to the next step.

It is possible to finalize the dataset submission at this step. In this case only the metadata of the dataset will be published without any data in form of dataset distributions.

In order to add data to the dataset, define label, access rights and description of the first dataset distribution.

Select data by drag & drop or use the "Browse" button to select data from your local directory. Click on the "Upload" botton to upload selected data into the current distribution.

Click on "Create another distribution" to add a new distribution and proceed by defining basic information for the distribution and uploding data as done previously.

Click on "Finalize submission" to finish the creation of the dataset.

Click on the link next to "Show working directory" to review the data and metadata included in the dataset. The content of the dataset directory will be displayed.

Click on "Continue with ingest" and then on "Start ingest" to start the publishing of the dataset.

Once the ingest of the dataset is finished, the dataset has a unique identifier (here: dma:2981afc2685d276f7995f88f8459e300e7d731a9) which identifies the dataset in the catalogue.

Each distribution is packaged as a TAR file.

1 https://tools.ietf.org/html/rfc4452

2 https://www.w3.org/TR/vocab-dcat/

3 https://gitlab.com/datamarket/conduit

5 https://datamarket.at/wp-content/uploads/2018/03/DMA-Core-Overview-1.pdf

6 http://www.openarchives.org/rs/toc

7 https://smart-contract-ui.datamarket.at/swagger.yaml

8 https://gitlab.com/datamarket/authorization-gateway

9 https://gitlab.com/datamarket/conduit/blob/master/docker-compose.yml