Key Stuffing Archive: Data Europe Guidelines

Data.europa.eu Data Quality Guidelines August 2021 This document was prepared for the European Commission, however it only reflects the views of the authors. Neither the European Commission nor any person acting on its behalf is liable forany consequence stemming from the reuse of this publication or the information contained therein, or for the content of the external sources, including external websites, referenced in this publication. For more information: OP.C.4 Publications Officeofthe European Union 2, rue Mercier L-2985 Luxembourg LUXEMBOURG OP-DATA-EUROPA-EU@publications.europa.eu The European Commission is not liable for any consequence stemming from the reuse of this publication. Luxembourg: Publications Office of the European Union, 2021 The reuse policy of European Commission documents is implemented by Commission Decision 2011/833/EU of 12 December 2011 on the reuse of Commission documents (DJ L 330, 14.12.2011, p. 39). Unless otherwise noted, the reuse of this document is authorised under a Creative Commons Attribution 4.0 International (CC-BY 4.0) licence (https://creativecommons.org/licenses/by/4.0/). This means that reuse is allowed provided appropriate credit is given and any changes are indicated. This publication is intended for information purposes only. It must be accessible free of charge. This publication was developed as part of the 'Data quality guidelines forthe publication of data sets in the EU Open Data Portal' project carried out by Fraunhofer FOKUS and financed by the ISA' prog ra m me. Print ISBN 978-92-78-42572-2 PDF ISBN 978-92-78-42573-9 HTML ISBN 978-92-78-42491-6 doi:10.2830/879764 OA-09-21-196-EN-C doi:10.2830/79367 OA-09-21-196-EN-N doi:10.2830/935433 OA-09-21-196-EN-Q Contents Introduction 7 1. Recommendations for providing high-quality data 10 Introduction 10 1.1. General recommendations 10 1.1.1. Findability 12 1.1.1.1. Describe your data with metadata to improve data discovery 12 1.1.1.2. Mark nullvalues explicitly as such 14 1.1.2. Accessibility 15 1.1.2.1. Publish data without restrictions 15 1.1.2.2. Provide an accessible download URL 16 1.1.3. Interoperability 19 1.1.3.1. Formatting of date and time 19 1.1.3.2. Formatting of decimal numbers and numbers in the thousands 21 1.1.3.3. Make use of standardised character encoding 22 1.1.4. Reusability 24 1.1.4.1. Provide an appropriate amount of data 24 1.1.4.2. Consider community standards 25 1.1.4.3. Remove d uplicates from your data 26 1.1.4.4. Increase the accuracy of your data 27 1.1.4.5. Provide information on byte size 28 1.2. Format-specific recommendations 29 1.2.1. CSV 29 1.2.1.1. Use a semicolon as a delimiter 29 1.2.1.2. Use onefile pertable 30 1.2.1.3. Avoid white space and additional information in the file 31 1.2.1.4. Insert column headers 34 1.2.1.5. Ensure that all rows have the same number of columns 36 1.2.1.6. Indicate units in an easily processable way 37 1.2.2. XML 38 1.2.2.1. Provide an XML declaration 38 1.2.2.2. Escape special characters 38 1.2.2.3. Use meaningful names for identifiers 40 1.2.2.4. Use attributes and elements correctly 41 1.2.2.5. Remove program-specific data 42 3 1.2.3. RDF 42 1.2.3.1. Use HTTP URIS to denote resources 42 1.2.3.2. Use namespaces when possible 43 1.2.3.3. Use existing vocabularies when possible 44 1.2.4.JSON 45 1.2.4.1. Use suitable data types 45 1.2.4.2. Use hierarchies for grouping data 46 1.2.4.3. Only use arrays when required 47 1.2.5. APls 48 1.2.5.1. Use correct status codes 48 1.2.5.2. Set correct headers 50 1.2.5.3. Use paging for large amounts of data 51 1.2.5.4. Documentthe APl 52 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment 54 Introduction 54 2.1. Reuse unambiguous concepts from controlled vocabularies 55 2.2. Harmonise the tables 56 2.3. Dereference the translation of a label 57 2.4. Linking and augmenting yourdata 59 3. Recommendations for documenting data 65 Introduction 65 3.1. Publish your documentation 65 3.2. Use schemas to specify data structure 66 3.2.1. How to specify JSON data structures 66 3.2.2. How to specify XML data structures 67 3.2.3. How to specify CSV data structures 68 3.2.4. How to specify RDF data structures 70 3.2.5. Howto specify APls 72 3.3. Document the semantics of data 74 3.4. Document data changes 75 3.4.1. Adopta data set release policy 75 3.4.2. Differentiate between a major and a minor release of a data set 76 3.4.3.lndicate a data set's version (release) number 78 3.4.4. Describe what has changed 79 3.4.5. Release one data set pertable 85 4 3.4.6. Deprecate old versions 87 3.4.7. Linkversions of a data set 88 4. Recommendations for improving the openness level 91 Introduction 91 4.1. Five-star model 91 4.2. Use structured data (one two stars) 92 4.3. Use a non-proprietary format (two three stars) 93 4.4. Use URIS to denote things (three four stars) 95 4.5. Use linked data (four~ five stars) 97 4.6. File formats and their achievable openness level 98 Glossary 100 Overview of quality indicators and metrics 106 Checklist for publishing high-quality data 112 List of figures 113 List of tables 113 Bibliography 114 List of topics (section number in brackets) 116 5 P Introduction Data quality is fast becoming a hot topic, as demand for high-quality data continues to growwith a focus on data thatis publicly available and can be easily reused fordif- ferent purposes. Poor quality is a major barrierto data reuse. Some data cannot be interpreted due to ill-defined, inaccurate elements such as missing values, mismatches, missing data types, lack of documentation about the structure or format availability (HTML, GIF or PDF). Users find poor-quality data harderto understand and may use it less often.The data provider may even appear less reliable as a result. For data to be easily reusable, data publishers must make sure it is easy to discover, analyse and visualise. Reusers must understand what the data is about and how it is defined or structured, and should preferably getthe data in the formatthey need. Data quality covers different aspects, for example consistency, conformity, completeness or documentation. The FAIR guiding principles for scientific data management and stewardship (') provide a framework for grouping the different aspects of data quality. The framework consists of four dimensions - findability, accessibility, interoperability and reusability - and provides concrete metrics for each dimension. Data publishers should become acquainted with the FAIR principles before publishing data. It is also helpful to develop a data management plan (DMP) that outlines how data should be handled. A DMP addresses questions such as where to publish data, where to store metadata, which format to use and which standard to follow. This sort of plan will make publication easier. Data needs to be carefully prepared before publication. Preparation is an interactive and agile process used to explore, combine, clean and transform raw data into curated, high-quality data sets. This process consists of six different phases (see Figure I). Profiling Distilling Enriching Documenting Validating Publishing ·Understanding data "Exploring data (values) ·Refining data ·Augmenting data ·Structuring data ·Cleansing data ·Data usage recommendations ·Versioning ·Assessing data ·Preserving data quality .Publishing in open formats µ Data preparation process > Figure 1. Data preparation process (') https://www.go-fair.org/fair-princip|es/ 7 W Introduction By ensuring data of the highest quality along with data consistency, conformity and completeness, data providers help reusers to easily discover, reuse, analyse, visualise or process data for analytics and business intelligence and to contribute to increasing the transparency of EU data. Forthese reasons, in 2019 the Publications Office of the European Union (the Publications Office) launched the 'Data quality guidelines forthe publication of data sets in the EU Open Data Portal (')' project, aimed at analysing major quality issues and providing a set of recommendations for data providers from the EU and its Member States concerning the quality of data resources available through the EU Open Data Portal (EU O0P).The project (') was carried out by Fraunhofer FOKUS (acknowledgements to Lina Bruns, Benjamin Dittwald and Fritz Meiners for their contributions) and consisted of the following three parts. - Data profiling. Analysis of the data published by the EU institutions and bodies to identify the most common data quality issues. This part consisted of two major steps. First, all metadata was assessed in an automated way against a set of criteria using the FAIR principles.This step was used to identify data sets of poor quality, which were analysed in depth in the second step. The second step was carried out manually and involved the analysis of 50 distributions from selected datasets.ln contrast to step one, the second step focused on analysing the actual data. The data was checked for encoding issues, accessibility, compliance with standards and proper presentation of numbers and dates. For more information about this part of the project please contact: OP-DATA-EUROPA-EU@publications.europa.eu - Data quality indicators and metrics. Identification of data quality dimensions, indicators and metrics to indicate how data quality can be measured. This part consisted of two main tasks. Firstly, identifying data-quality indicators and metrics appropriate for assessing data quality, and secondly, developing mock-ups for a future data quality dashboard. The first task led to the identification of 12 relevant indicators for data quality acrossthe four FAIR dimensions (see Figure 2). 'L (') On 21 April 2021 the EU Open Data Portal and the European Data Portal were consolidated into one single service and became data.europa.eu. (') The project was financed by the ISA2 programme. 8 Introduction W Findability Accessiblity Interoperability Reusability Completeness Accessiblity/availabihty Conformity/compliance Timeliness Machine readabilityl Findability . Consistency processability Openness Accuracy Relevance h Understandability Credibility Figure 2. Overview of quality indicators grouped by FAIR dimensions Metrics were also assigned for each indicator that show how to actually measure and quantify the quality indicators. In total, 42 metrics were described and illustrated with real data mostly taken from the EU ODP (') (see Table 6). For more information about this part of the project please contact: OP 1)AIA 1UROI'/\ FU@publications.europa.eu - Recommendations for delivering high-quality data. A set of recommendations for data providers from the EU and its Member States. The current document is based on the outcome of Parts 1 and 2 and on a literature review. The recommendations are addressed to data providers to support them in preparing their data, developing their data strategy and ensuring data quality. It is composed of the following four parts. 1. Recommendations for providing high-quality data. The recom mendations cover general aspects of quality issues regarding the findability, accessibility, interoperability and reusability of data (including specific recommendations for common file formats like CSV, JSON, RDF and XML). 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment. 3. Recommendations for documenting data. 4. Recommendations for improving the 'openness level'. At the end ofthe publication the readerwillfind a glossary, a tablewith the overview of quality indicators and metrics, a checklist with the most important steps for improving the quality of data and metadata and a list ofliterature. (') Please note that the interface has changed after the consolidation of EU Open Data Portal and the European Data Portal into data.europa.eu. 9 1. Recommendations for providing high-quality data Introduction The aim of this section is to provide quick and practical recommendations for data providers, allowing them to prepare and publish high-quality data sets. It presents a set of best practices for data preparation, especially covering aspects of the data preparation process phase 'validating' (see Figure 3). h Profiling ·Understanding data ·Exploring data (values) Distilling Enriching Documenting Validating i Publishing ·Refining data ·Augmenting data ·Data usage ·Assessing data ·Structuring data recommendations quality ·Cleansing data ·Versioning ·Publishing in open, machine-readable formats Data preparation process > Figure 3. Data preparation process - Validating An overview of universally applicable recommendations is given in Section 1.1, followed by format-specific recommendations in Sec tion 1.2 addressing commonly used and open-data-appropriate (machine-readable and non-proprietary) file formats. 1.1. General recommendations This section provides general recommendations to consider when publishing data. These recommendations apply to all kinds of data, regardless of the file format they are published in. The recommendations are grouped by the FAIR dimensions (') of findability (Section 1.1.1), accessibility (Section 1.1.2), interoperability (Section 1.1.3) and reusability (Section 1.1.4). Each recommendation includes a description, screenshots and a reference to the respective metric, as well as helpful information about tooling and/or linkage to further relevant sources of information. File-format-specific recommendations forthe machine-readable formats CSV, XML, RDF,JSON and APls are covered in Section 1.2. Before going on to the recommendations, there are two things you should consider in general if you are interested in publishing high-quality data: (i) make use of tooling, (ii) create a DMP. 10 (') https://www.go-fair.org/fair-principles/ 1. Recommendations for providing high-quality data < (i) Make use of tooling Data preparation is an ongoing, iterative and repetitive process. Most of the steps which should be performed within the data preparation process (see Figure 3) can be automated and supported with tools. If you are publishing data periodically, it might be worth investing in an 'extract, transform, load' (ETL) tool and related tools that support you in preparing and publishing high-quality data sets. There are plenty of commercial tools that can help you prepare your data following the data preparation process (see Figure 3). A large number of solutions are available and the data preparation functions they offer are heterogeneous, so finding the right one might seem daunting. Data preparation functions are, for example, transforming, cleansing, blending, modelling and enriching data. Gartner Research has analysed 16 tools available from common vendors and classified them in a magic quadrant, identifying '|eaders: 'cha||engers: 'niche players' and 'visionaries' (see Figure 4), with strengths and cautions for each vendor ('). This assessment may help you to find the most appropriate tool for the task at hand. Gartner Research has also published the 'Market guide for data preparation tools' ('), in which the market is analysed and several products are introduced. Another report that lists and compares data preparation solutions is 'The Forrester Wave'": Data preparation solutions' ("). A ABILITYTO EXECUTE Figure 4. Magic quadrant for data quality tools Source: Gartner Research (2019a). (') Gartner Research (2019a),'Magic quadrant for data quality tools'(1' ps://www.gartner.com/en/ ':1':' '9)'·, '( ':,' rri,'"ii' -q'.,":h,'r -i':r-':L' ,'-q'.,'li ,- ')':|·). (') Gartner Research (2019b),'Market guide for data preparation tools'(F.tµ://www.gartner.com/en/ ':1':' .';,'j)f ':,") -.'jl i,1, I .I .I , [.I. [., I' i':r'- ';j;|·). (') Little, C. (W18),'The Forrester Wave'": Data preparation so|utions: Forrester (https://www.forrester.com/report/ ii'.:' |.)[[(·. (·1 N,'· (' I),' ,' F[':·l::,'[,' 1')1' ':.:||. i.:r'. 1')| ,'1')'.'· -|-|'| 5141619). 11 |@' >1. Recommendations for providing high-quality data There are also some usefulopen-sourcetools that mostlyfocus on one concrete aspect of data preparation orthat specialise in data quality issues within a certain file format, such as CSVLint (') for CSV files or JSONLint ('°) for JSON files. Another open-source tool is OpenRefine ("), which helps clean messy data and transform and extend data. Talend's Open Studio Line (") is another open-source suite licensed under Apache. It is made up of components covering (big) data preparation and integration and data quality and uses machine-learning technology to perform data preparation tasks. (ii) Create a data management plan A DMP outlines how data is to be handled. It should establish where to publish data, where to store metadata, which format to use and which standard to follow. Answering these questions beforehand will make the publication process easier as it will be homogeneous and formalised. There is also a common standard for machine-actionable DMPS ("), and the FAIRification process provides some useful information you maywish to considerin your DMP ("). 1.1.1. Findability 1.1.1.1. Describe your data with metadata to improve data discovery Dimension Indicator Metrics Findability Completeness · Numberof empty&ldsin metadata · Keywords assigned · Categories assigned · Temporal information given · Spatial information given Metadata is descriptive data. Take for example an audio track: information regarding the artist and album is considered metadata, since this information is not part of the actual file. It is, however, very important when trying to find the file among others. Similarly, if a text document was missing its title, it would be very hard for users to discover the document. Complete and updated metadata is therefore vital for finding and using data. In addition, metadata can help users identify whether the information retrieved matches their request. A library of books would be of little use if the books were missing their key metadata information: author, title and ISBN. The same applies to data published online. (') http.: ' ·' lit' .i': (id) http " ""['lit' (") (") http·: """. ,'l':'r":l.' ':rri [:'[':':||.' . ,|.[ .I .[..i '.di': (") http,: .'1i.|'l.l::.. .:rri |'[:I",-[:i|·.|[.'-i: ,:rr,rn.:r |'[:i",-[:i|·.|[—i: .:rr,rr,or-', ,'r.:L'r.:1 (14) http " """ "1"-|,'i["["l|,'i[_['[i[" ij'l"" |,'i[i|'|',' i i [ i 12 1. Recommendations for providing high-quality data Often, when publishing your data in a catalogue, some metadata fields are set as mandatory, which means that they have to be filled in before the data can be published. However, it is recommended that metadata fields that are not set as mandatory also be filled in. Forthe data publisheritdoes nottake much effortto fillin thesefields, and for data users complete metadata can be very beneficial. The more information given about data, the easier it is for users to find and to get a first understanding of, which in turn increases the chances that they will reuse it. The following metadata information should be provided in order to increase the findability of data: · title · description · keywords · categories · temporal information · spatial information. When filling in this metadata information, data publishers should make sure that the information given is as precise, accurate and helpful as possible. Keep in mind that a potential user has probably never seen your data before and needs to get a clear understanding of what your data is about. Good example This screenshot shows that a detailed description is given for the 'Production in industry - manufacturing' data set. This helps potential users to get an overview of what to expect in the data set. Production in industry - manufacturing < h Description The industrial production index shows the output and activity of the industry sector. It measures changes in the volume of output on a monthly basis. Data are compiled according to the Statistical classification of economic activities in the European Community, (NACE Rev. 2, Eurostat). Industrial production is compiled as a "fixed base year Laspeyres type volume-index". The current base year is 201 5 (Index 201 5 - 100). The index is presented in calendar and seasonally adjusted form. Growth rates with respect to the previous month (M/M-1) are calculated from calendar and seasonally adjusted figures while growth rates with respect to the same month of the previous year (M/M-12) are calculated from calendar adjusted figures. M 13 1. Recommendations for providing high-quality data Bad example In this example,the description of the data set isthe very similarto the data set's title and does not provide any helpful information.A userwould have a hard time getting a grasp of what the 'Interest rates - monthly data' data set may contain. alnterest rates - monthly data < h Description Interest rates - monthly data h eurovoc domains Economy and finance, Regions and cities Helpful links and tools Title Description Whatis metadata and why is it as An online article from opendatasoft important as data itself? that provides helpful information about metadata (e.g. de&ition, purpose). 1.1.1.2. Mark null values explicitly as such Dimension Findability Indicator Findability Metrics · Number of null values Link https://www.opendatasoft.com/ blog/2016/08/25/what-is-metadata- and-why-is-it-important-data Sometimes, data is simply not complete. However, a missing value is no reason for not publishing the data in question. In orderto avoid confusion,the data provider should clearly mark missing values as nullvalues. Users that are not familiarwith the data can thus recognise that the data was not simply forgotten, because the null value serves as special marker indicating that the value does not exist. In other words, a null value is a visual representation of a missing value. There are several ways of indicating a null value, for example by marking the missing value with 'NULL or'Nlt However, if you notice thatwithin your data you have a high percentage of null values within one row or column,you should considerdeleting the respective column orrow as it probably does not bring any added valueto data users. The example below showsa CSVtable with data about page visits.ln the tablelabehed 'bad examp|e: missing values are indicated by simply leaving fields empty. This is ambiguous and may lead to errors during further processing. In contrast, the table labelled 'good example' showsthe same data, butwith missing values clearly marked as such. 14 1. Recommendations for providing high-quality data < OBad example Year; Visitors, Viewing time 2014;768954;00:03:18 2013;;00:02:59 2013;822101;00:02:59 2011;721519; 2010;707402;00:03:50 SGood example Year; Visitors; Viewing time 2014;768954;00:03:18 2013;null;00:02:59 2012;792967;00:02:52 2011;721519;null 2009;429430;00:03:16 1.1.2. Accessibility 1.1.2.1. Publish data without restrictions Dimension Indicator Metrics Accessibility Accessibility · Downloadable without registration One of the core principles of open data is its accessibility: data should be accessible and available to the widest range of users possible to avoid limiting its potential reuse. To allow easy consumption and further processing, no access restrictions should be in place, regardless of whether these reg uire manual intervention (eg. registration) or can be bypassed automatically (eg. providing credentials).This also appliestothefiles themselves, for example encrypted archives. Keep in mind that any access restriction limits the number of potential data users and so, if possible, should be avoided. Good example This screenshot shows a data set which is directly downloaded when the user clicks on 'download'. No registration or password is needed. COVID-ig_Coronavirus data - daily (up to 14 December 2020) .Iv COVID-ig cases worldwide - daily £ DOWNLOAD h Description Data on the geographic distribution of COVID- 19 cases worldwide " Format CSV 15 |@' >1. Recommendations for providing high-quality data Bad example This example shows a data set which cannot be downloaded without a password. This hampers its reuse and is not in line with open data principles. acode: error: message: "authentication_required" true "You must be logged in to access this resource" Helpful links and tools Title Ten principlesforopening up government information Description Description of the core open data principles. Pay attention to Principle 4 'Ease of physical and electronic access'. Link https://sudightfoundation.com/pdicy/ documents/ten-open-data-principles/ 1.1.2.2. Provide an accessible download URL Dimension Indicator Metrics Accessibility Accessibility/availability · Download URL given · Download URL accessible Data can only be reused by others if it is accessible. Typically, the main point of access is a download URL,which must be set in the metadata and be accessible, i.e. reachable via a browser. This means the data publisher must ensure that when a user clicks on the download URL provided, this URL functions properly and the user can directly download the data. Good example This screenshot shows three download URLS given for a data set, each pointing to a different file format.The download begins directly when the userclicks on the download button. Resources ± DOWNLOAD ± DOWNLOAD ± DOWNLOAD Download dataset in TSV format (unzipped) tsv Download dataset in TSV format ZIP Download dataset in SDMX-ML format zip 16 1. Recommendations for providing high-quality data d Bad examples These screenshots show a download URL which redirects the user to another web page instead of initiating a file download. ± download Disseminaubn database - Consumer Conditions Scoreboard html Year Counhy 2018 ... - Data bV Country and Year Couwtrv Year Variable RestAs KC_C' Mmqe pe'cmtage d cQrkurmrs cxmxmv to qtmms on clxtswner m 44,8 Kl _ C: Pemmqe of https://data.europa.eu/data/datasets leu-milk-market-observatory-eu- production-of-main-dairy-products- summary?locale=en 200 Request URL Status codes > https://data.europa.eu/data/euodp/dataset m /e4c9253d-ab72-43cc-8ec4-25952eeb278e /resource/6fd65cf7-b32c-458c-869e-329757aa7412 /down|oad/e|rc310eng|ishbu|garian|ega|- termsmd.xml HTTP status codes This site provides a list of all status codes https://www.iana.org/assignments/ andtheir meanings (open source). http-status-codes/http-status-codes. xhtml 1.1.3. Interoperability 1.1.3.1. Formatting of date and time Dimension Indicator Metrics Interoperability Conformity/compliance · Conformity of date formats Data (and metadata) often contains dates and times. Depending on the regional conditions, there are different ways of stating dates, which can lead to confusion. The following example highlights the issue with ambiguous date formats: 01/02/2020 could mean either 1 February 2020 or 2 January 2020, depending on a country's customs. 19 |@' >1. Recommendations for providing high-quality data Therefore, date and time should always be encoded as ISO 8601 (YYYY MM-DO hh:m- m:ss). If applicable, the time zone used should be stated. The time zone is always derived from Coordinated Universal Time (UTC). The examples below show a CSV table with data about page visits. In the bad examples, the time format does not follow a consistent schema, making it very hard to process correctly. In contrast, the good examples show the same data with all timestamps formatted using ISO 8601 encoding. aBad example Year; Visitors, Viewing time 2014;768954;3:18 2013;822101;00:02:59 2012;792967;0:02:52 2011;721519;03:44 2010;707402;3m:50s 2009;429430;3:16 aBad example Start Date; End Date 01.01.2014; 31.03.2014 01.01.2014; 30.06.2016 SGood example Year; Visitors; Viewing time 2014;768954;00:03:18 2013;822101;00:02:59 2012;792967;00:02:52 2011;721519;00:03:44 2010;707402;00:03:50 2009;429430;00:03:16 SGood example Start Date; End Date 2014-01-01; 2014-03-31 2014-01-01; 2016-06-12 20 1. Recommendations for providing high-quality data j¢ Helpful links and tools Title Description ISO standard for date andtime An introductionto ISO 8601 for date and time formats (open source / commercial). DenCode ISO date andtime generator, encoderand decoder. Thistool helps you to convert your data into ISO 8601 formats (open source). Link https://www.iso.org/iso-8601-date-and- time-format.html https://dencode.com/date/iso8601 DenCode Enjoy Encoding t: Decoding| 0 E%|ish S " a 17.10.2020 "0100 Europe,"Berlin " c) Encoded ISO8601 Date 00171001T000000·0100 ISO8601 Datg (Extend) 0017-10-01T00:®:00+01:00 ISO8601 Date (Week) 0017-W39-5TOO:(X):00+01:00 1$08601 Date (Ordina|)0017-274T00:00:00+01:00 1.1.3.2. Formatting of decimal numbers and numbers in the thousands Dimension Interoperability Indicator Conformity/compliance Metrics Data often contains numbers. In this section we do not want to give detailed information on how to handle different numeric types (integer, float, double), but rather recommendations on howto deal with numbers in a more general sense. Forexample, a comma is often used to separate whole numbers from decimals. This might cause problems, for example in a CSV file when the separator between the values is set as a comma. To avoid the unintended interpretation of a comma separating a whole numberfrom a decimal, a dot should be used instead. When dealing with large numbers, sometimes a thousand separatoris used, for example a dot or white space. Again, this can lead to misinterpretation - especially when the data is being processed automatically- and might mean the user hasto clean the data before they can reuse it. Thousand separators should therefore not be used. 21 >1. Recommendations for providing high-quality data aBad example 0,53 789.654 789654 25.026,8 SGood example 0.53 789654 789654 25026.8 1.1.3.3. Make use of standardised character encoding Dimension Indicator Metrics Interoperability Conformity/compliance · Character encoding issues In order to make sure that characters are displayed correctly, and to ensure the greatest possible compatibility with applications processing data, a standardised character encoding should always be used. Typically, UTF-8 is the encoding of choice on the web. UTF-8 is a character encoding for Unicode, an international standard for the representation of all meaningful characters. With this, all characters, whether Latin alphabet orjapanese characters, are displayed correctly. To ensure that your data can be blended and reused with other data from international sources and to avoid problems during machine processing, it is helpfulto use an internationally recognised and widely used character set encoding from the outset. However, in general you should avoid using any special characters in your data, even if they are part of UTF-8. In doing so, backward compatibility with older systems is encouraged. Depending on the program you are using, UTF-8 must be activated explicitly in the 'Save-As' dialogue. In Microsoft Excel and in LibreOffice Cale, for example, you can select the character encoding explicitly when saving a CSV file. If a different character setthan UTF-8 is used in your data, it is essentialto specifythis in the metadata. DCAT- AP does not specify a dedicated field for this information. However, Inspire suggests adding this type of information to the 'media type' description ("). h (") http,: le.-,' "1"" ['i ':' rri':' ,":1,' ,' "il..i 11.',[·'||'| A::[l:|'||(·_l:|j:i( ",|-",1·' _-J .I. i. i . .': |','1,', (·1- encc:dir.'i 22 1. Recommendations for providing high-quality data j¢ Bad example This screenshot shows a data set which does not use UTF-8, as you can see in the text highlighted in yellow. 10 Trends Transhrming Education as We Know 1( Back in the Game ac Reclaiming Europea€"s Digital Leadership Video Explainer 2017-11-14TOO:OO:(X)+O1:OO 2017-11-13TOO:OO:(X)+O1:OO NOrd Stream 2 a€" Divide ci lmpera Again? Avoiding a Zero-Sum Game 201740-27TClO:OO:OO"O2:OO Good example This screenshot shows the same data set, this time encoded in UTF-8. 10 Trends Transforming Educajion as we Know It Back in the Game — Reclaiming Europe's Digital Leadership Video Explainer 2O1741-14TOO:OO:OO"O1:00 2017-11-13TOO:OO:OO"O1:00 Nord Stream 2 - Divide el Impera Again? Avoiding a Zero·Sum Game 201740·27TCK):OO:OO"O2:0O Helpful links and tools Title UTF-8 validator Description This online tool helps you check your input forvalid UTF-8 encoding (open source). Link https://odineutf8tods.com/ validate-utf8 utf8 ^ rrber,cateBQry,tgg";;; "ID Trends 5hapin® Democracy in a Volatile mrld,,2019-10-31,https://ec.europa.eu /Qp$c/publication$/Qth0r-publicatiQn$/l0-treM$- shaping-demcracy-volgtile-®rld,en,""
At the onset of the digital revolution, there was significant hope - DM indeed an expectation - that digital technologies would be a boon to democracy, freedom OM societal engagement, Yet, today, there is legitimate disquiet among everyone who believes in liberol democracy. This p3pcr looks at how demcrecy worldwide is evolving, singLing out threats am challenges, but also potential opportunities ahead.
"",https: //ec.europa.eu/epsc/sites/epsc/files 1mµmtromhie Satveas . Copytocfipbomd EL 23 W 1. Recommendations for providing high-quality data CSVLint You can usethistool to check whether https://csvlint.io yourCSV®e contains any encoding issues.lfthetoddetectsthatyourCSV is encoded in UTF-8 but contains invalid cha racters, you will get an error message (open source). Structural problem: Invalid Encoding on row 1 I Your CSV appears to be encoded in l.llf g , but invalid characters were found. This can often be caused by copying and pasting data from a different source. 1.1.4. Reusability 1.1.4.1. Provide an appropriate amount of data Dimension Indicator Metrics Reusability Relevance · Appropriate amount of data Depending on the data to be published, the meaning of the term 'appropriate' can differ greatly. It is important to publish all relevant data, but caution should be taken not to blindly publish all available data without considering its usefulness. On the other hand, data publishers have to make sure that a sufficient amount of the data is published, so thatthere is enough context and users can derive value from it.ltwould be rather useless for data users to find a CSV file with only two lines. However, there is no clear indication of what an appropriate amount of data is, as this is highly dependent on the purpose a user has in mind. To find a good balance, you could start by asking yourself whether all the data you are about to publish really provides value to others. If not, you could think about reducing your data if it seems like a large amount. On the other hand, you could ask yourself if the amount of data you wantto publish is sufficient for usersto make sense ofitand to add value, orif you should add more data or context. Bad example ^ a Name Size e traffic_2010-2015.csv 976,563 KB The file in the screenshot contains fictitious traffic data aggregated over the course of 6 years. In total, the file is nearly 1GB in size. If users are only interested in data for 1 year, they still have to download the entire file. 24 1. Recommendations for providing high-quality data Good example ^ Name Size 9': traffic,2010.csv 97,662 KB ¢1 ' traffic_2011.csv 297,833 KB g .' traffic_2012.csv 228,536 KB G'] traffic_2013.csv 165,139 KB ¢1;' traffic_2014.csv 39,164 KB traffic,2015.csv 144,886 KB In contrast, this screenshot shows the same data split by year. This way, the file size remains reasonable and users can download the exact files they need. Each file should be published in a separate data set. 1.1.4.2. Consider community standards Dimension Indicator Metrics Reusability Consistency ·Compliance with community standards Community standards are a powerful tool for ensuring conformity across files and formats of a common domain. Using community standards makes it easierto reuse data, as all data following the same standard looks similar -for example it is organised in a standardised way, the documentation follows a common template or a common vocabulary is used. Lots of different community standards exist, for example standards for specific domains such as climate and forecast, astrophysics or statistical data. But there are also non-domain-specific standards,such as DCAT-AP, a standard for storing data catalogue metadata. Depending on the use case,there may be validatorsthat aid in checking files against such a standard. Ensuring the compliance of files against community standards greatly helps reusability and eases further processing. To make sure that your data is being reused, you should consider using community standards. Bad example This screenshot shows a message from a SHACL validation which produced an error against the DCAT-AP community standard. More precisely, the value that was attached to the property dcterms.'publisher was not of the required type. a EMIS - List of Web services http://purlmg/dc/terms/publisher Value does not have class http://xmlns.com/foaf/O.1/Agent 25 W 1. Recommendations for providing high-quality data Good example This screenshot shows a data set with an XML resource that conforms to its schema. Resources ,L DOWNLOAD ,L DOWNLOAD ± DOWNLOAD £ DOWNLOAD £ DOWNLOAD ± DOWNLOAD £ DOWNLOAD £ DOWNLOAD Consolidated Financial Sanctions File LO i CSV 1 Consolidated Financial Sanctions File LO ! xml i Consolidated Financial Sanctions File I. 1 i CSV i Consolidated Financial Sanctions File I. 1 I xml ' Consolidated Financial Sanctions In PDF Format pdf I EU sanctions map html Financial Sanctions Files (FSF) website 1,htmlj Sanctions List i' rss feed I Documentation £ download Consolidated FihanctMSanctibns File (XSD schema 1.0) xml schema a, download Consolidated F/hancia/Sancubns File (XSD schema j. I) xml schema Helpfullinks and tools Title Description FAIR list of community standards Listof communitystandardsforvarious domains (open source). SHACL validator This online tool allows you to validate your RDF®esagainsta given standard (open source). Link https://www.go-fair.org/fair-principks/ r1-3-metadata-meet-domain-rele- va nt-community-sta ndardsl https://shad.org/playground/ 1.1.4.3. Remove duplicates from your data Dimension Indicator Metrics Reusability Consistency · Freeness from duplicates Each piece of data should be unique. Duplicate data is of no additional value. Instead, itlowers the quality of the data as it might cause errors during further processing. For example, a data user performing analytics on the data will receive biased results as some data are duplicates. 26 1. Recommendations for providing high-quality data < Examples The table labelled 'bad example' shows a CSV file where some rows are duplicates. In contrast, the rows in the table labelled 'good example' are all distinct, and no row carries the same information as another one. WBad example Year; Visitors; Viewing time 2014;768954;00:03:18 2013;822101;00:02:59 2013;822101;00:02:59 2011;721519;00:03:44 2010;707402;00:03:50 2010;707402;00:03:50 SGood example Year; Visitors; Viewing time 2014;768954;00:03:18 2013;822101;00:02:59 2012;792967;00:02:52 2011;721519;00:03:44 2010;707402;00:03:50 2009;429430;00:03:16 Helpful links and tools Most ETL tools provide functions for detecting missing data and handling nullvalues. 1.1.4.4. Increase the accuracy of your data Dimension Indicator Metrics Reusability Accuracy · Percentage of accurate cells Accuracy can be measured in many dimensions. What accuracy means specifically, how it is measured and what result is deemed acceptable always depend on the specific use case. For example, in CSV files, each cell of a column could be checked for accuracy against an encoding format, for example ISO 8601 for dates. The ratio between accurate and inaccurate cells could then give users a first impression of what to expect from the data and how difficult processing may be. Higher accuracy istypically an indicator of higher-quality data. Examples When evaluating the conformity of the 'Viewing time' column against ISO 8601 encoding, the table labelled 'bad example' would score an accuracy rating of 50 %, since half of the cells follow this time format. In contrast, the table labelled 'good example' would yield an accuracy score of 100 %, since all timestamps are correctly encoded. 27 >1. Recommendations for providing high-quality data aBad example Year; Visitors; Viewing time 2014;768954;3:18 SGood example Year:Visitors:Viewingtime 2014;768954;00:03:18 2013;822101;00:02:59 2013;822101;00:02:59 2012;792967;0:02:52 2012;792967;00:02:52 2011;721519;03:44 2011;721519;00:03:44 2010;707402;3m:50s 2010;707402;00:03:50 2009;429430;3:16 2009;429430;00:03:16 1.1.4.5. Provide information on byte size Dimension Reusability Indicator Accuracy Metrics · Content size accuracy When publishing data, it is good to also provide information on the distributions' byte size. This information helps users and automated processes to anticipate what to expect before downloading the actual file. Also, this information enables filtering by size. Bad example This screenshot shows a distribution without the dcat.'byteSize property set. Sample Title< /dct:title> Good example This screenshot shows a distribution forwhich the dcat:byteSize property is set. Sample Title 12168 28 1. Recommendations for providing high-quality data < 1.2. Format-specific recommendations 1.2.1. CSV Please check the general recommendations in Section 1.1, which also apply to CSV files. 1.2.1.1. Use a semicolon as a delimiter Dimension Indicator Metrics Interoperability Machine readability/processability · Processability of E8e format and media type Even though the name 'CSV' (comma separated values) implies the use of commas as separators between each value, we recommend using semicolons instead. Commas are often used in the values themselves (forexample when using decimal numbers).To avoid a comma being interpreted as a separator, it would need to be masked. Masking is not a problem in itself, but it can be a source of error if you overlook a comma that needs to be masked. Semicolons are used less often within the actual values and should thus be used as delimiters in CSV files. The delimiter is always set between two values, and the last value in line is not followed by a delimiter as depicted in the examples. Make sure that there are no spaces or tabs on either side of the delimiters in the row. aBad example Year; Visitors; Viewing time; 2013; 822101;00:02:59; 2012;792967;00:02:52; 2011; 721519;00:03:44; 2010;707402;00:03:50; 2009;429430;00:03:16; 5Good example Year; Visitors; Viewing time 2013;822101;00:02:59 2012;792967;00:02:52 2011;721519;00:03:44 2010;707402;00:03:50 2009;429430;00:03:16 ij 29 1. Recommendations for providing high-quality data Helpful links and tools Title CSVLint Description Link This online tool helps you to detect white https://csvlint.io space between delimiters and values (open source). Structural pmem: Unexpected whitespace on row 43 Mercerie Pasmanterie > Pasm3nteriQlmttler||PS85Neutra]$-kit|Aa do birodat POLY SHEEN0 KIT NEUTRAL S, 8 BUCATI , METTLER|"PS8-NEUTRALS Kit -ul " cQntin0 8 dc culori frt.moabe, caro au o lungime do 20 POLY SHEEN creeaza o suprafaU cc reflecta lumina, ·bOtiv peotru care strQluceSte frimos, Kai mult , POLY SHEEW Dre o. .. |http5 ://ma$inidecu5ut .ro/ata-coton-poly-5heenr-kit -neutral5-8-bucat i -m ettler.html|https: //msinidecusut .ro/media/catal%lpro1. Recommendations for providing high-quality data Good example The good example below shows a cleared version of the same data. All additional information has been removed. COuAUY om $0u¢c¢ reported tW repomd bv pnv$k%s repo«ra 1)y hWKQ$ recxjcTed tW others cue ~ Acmm at-eo1. Recommendations for providing high-quality data 1.2.1.5. Ensure that all rows have the same number of columns Dimension Indicator Metrics Interoperability Conformity/compliance Machine readability/processability · Data following a given schema · Processability of E8e format and media type It is very important that each row has the same number of columns and thus follows the structure of a CSV. This means that each row should have the same number of delimiters. If one row is missing a value, this usually gets interpreted as 'null'. This can lead to erroneous processing of data. If your CSV contains rows with a different number of columns, you should check whether there is an issue with incorrectly escaped values (e.g. a value contains a semicolon which is not masked and thus gets interpreted as a delimiter). aBad example Year, Visitors 2014;768954;00:03:18; 2013;822101 2012;792967;00:02:52; 2011;721519;00:03:44; 2010;00:03:50 2009;429430;00:03:16; Helpful links and tools Title GoodTables WGood example Year; Visitors; Viewing time 2014;768954;00:03:18 2013;822101;00:02:59 2012;792967;00:02:52 2011;721519;00:03:44 2010;707402;00:03:50 2009;429430;00:03:16 CSVLint Description Link GoodTables is atool to validatetabular https://friction|essdata.io/t00|ing/g00d- data and checks,forexample whether all tables/#a-simple-example rows havethesame numberofcolumns (open source). This online tool helpsto detect rowsthat https://csvlint.io contain adi&rentnumberofcolumns (open source). 1. Recommendations for providing high-quality data 1.2.1.6. Indicate units in an easily processable way Dimension Indicator Metrics Interoperability Conformity/compliance Machine readability/processability · Data following a given schema · Processability of ®e format and media type Numeric values should follow the general recommendations given in Section 1.1. A value's unit should be stated in the relevant column header so that the unit becomes clearto the user. Additionally, the unit of measurement used in the data can be referenced in the corresponding stat.'dcat metadata. If the unit varies, a dedicated column for the unit should be used. Putting the unit directly behind the numeric value in one cell makes it harder for users to process the data. Ideally, the corresponding values from the controlled vocabulary (") should be used. aBad example Ingredient Amount Carbohydrates 16g Magnesium 2mg WGood example Ingredient Amount Carbohydrates 16 Magnesium 20 Unit g mg Carbohydrates 16 Magnesium 20 Unit (") https://op.europa.eu/en/web/eu-vocabu|aries/at-dataset/-/resource/dataset/measurement-u nit 37 |@' >1. Recommendations for providing high-quality data 1.2.2.XML Please check the general recommendations in Section 1.1, which also apply to XML files. 1.2.2.1. Provide an XML declaration Dimension Indicator Metrics Reusability Consistency ·Compliance with communitystandards Each XML file should have a complete XML declaration. This contains metadata regarding the structure of the document and is important for applications to properly process the file. For example, information regarding XML version and character encoding are typically present in the declaration. aBad example This screenshot shows an XMLwithout a declaration. AppleGermany true WGood example This screenshot shows the same XML with a properly formatted declaration. AppleGermany true Grape ltaly

false 1.2.2.2. Escape special characters Dimension Reusability Indicator Consistency Metrics 38 1. Recommendations for providing high-quality data < When special characters are used in XML files they need to be escaped. This ensures a sound file structure and prevents applications used for processing the file from misinterpreting the data. Escaping is done by replacing them with the equivalent XML entities. An overview of the characters is shown in Table 1. Table 1. Characters that need escaping in XML , 0 0 0 ' 0' & & · . &1t; < ·. · . > > 0 " " ' ' aBad example This screenshot shows an XML without escaping. App1e <> Germany "Very tasty!" WGood example This screenshot shows the same XML with properly escaped characters. App1e &1t; > Germany "Very tasty!" Helpful links and tools Title XML Escape /Unescape Description Link 0nlinetool that escapesspecial charac- https://www.freeformatter.com/ tersintextsotheycan be usedinXML xml-escape.html (open source). 39 >1. Recommendations for providing high-quality data 1.2.2.3. Use meaningful names for identifiers Dimension Indicator Metrics Reusability Consistency ·Compliance with communitystandards h All identifiers, whether tags or attributes, should have meaningful names and should ideally not be used twice.There are no official recommendations regarding the spelling of the identifiers, so you can use, for example, camelCase or PascalCase. However, different forms should not be mixed together. Furthermore, special characters should not be used in the identifiers. aBad example This example shows XML with the 'fairtrade' identifier (i.e. the element's name) not being written using PascalCase or camelCase, making it harderto read by humans and thus prone to processing errors. WGood example This screenshot shows XML with an identifier which consists of two words being concatenated via camelCase. "type>App1eGermany true true AppleGermany true true Helpful links and tools Title Title Case Description Link Thistoolconverts phrases consisting of https://titlecase.com/ multiple words into various case formats (open source). 40 1. Recommendations for providing high-quality data < 1.2.2.4. Use attributes and elements correctly Dimension Indicator Metrics Interoperability Conformity/com pliance Machine readability/processability · Data following a given schema · Processability of E8e format and media type While there is no mandatory binding directive as to whetherdata should be encoded in elements or attributes, it has been established as best practice that information that is part of the actual data should be represented by elements. Metadata that contains additional information should instead be implemented as attributes. For example, in the snippet labelled 'good example', the 'id' is part of the metadata and thus an attribute of a 'fruit' type element. In the snippet labelled 'bad examp|e: information has been encoded in attributes forwhich elements should have been used instead. aBad example This screenshot shows XML in which data has been encoded using attributes where elements would have been more suitable. SGood example This screenshot shows XML in which data and metadata have been encoded using elements and attributes correctly. Helpful links and tools Germany AppleGermany

true Title XML speci&ation Description W3C recommendationsfor XML (open source). Link https://www.w3.org/TR/2006/REC- xml11-20060816/ Ib 41 >1. Recommendations for providing high-quality data 1.2.2.5. Remove program-specific data Dimension Indicator Metrics Interoperability Conformity/compliance Machine readability/processability · Data following a given schema · Processability of E8e format and media type XML, as with any open format, should always be independent of specific programs or tools used for processing the files. This allows the user to choose the tool they prefer for processing the data without having to sanitise it first. aBad example ThisscreenshotshowsXMLwhich contains a version number of a hypothetical program that has been used for the creation or processing of the file. This information does not add anything to the data and should thus be removed. AppleVery tasty myXmlTool 1.2.3. RDF Please check the general recommendations in Section 1.1, which also apply to RDF files. 1.2.3.1. Use HTTP URIS to denote resources Dimension Indicator Metrics Interoperability Conformity/compliance Machine readability/processability · Data following a given schema · Processability of E8e format and media type Resource IDS should be HTTP URIS, since ideally these allow direct access to the resource in question. They also make resources indexable by search engines, which enhances their findability. This only applies, however, if these identifiers are persistent and do not contain volatile information, for example credentials. 42 1. Recommendations for providing high-quality data < aBad example This screenshot shows a resource in RDF/XMLwhich is not denoted via HTTP URI. SGood example This screenshot shows a resource in RDF/XML which is denoted via HTTP URI. 1.2.3.2. Use namespaces when possible Dimension Indicator Metrics Reusability Consistency ·Compliance with communitystandards While namespaces are not required for processing RDF, they reduce verbosity and file size. Similarly to the recommendations regarding plain XML, identifiers for classes should be written in PascalCase while identifiers for properties are typically written in camelCase. aBad example RDF without namespaces and identifier conventions applied can be harderto read. Sample WGood example This screenshot shows the use of namespaces as well as conventions for class and property identifiers, which improves readability. Sample 43 W 1. Recommendations for providing high-quality data Helpfullinks and tools Title Ontotext Anzo 0penRe&e Trifacta Wrangler Description Link Toolthat allows importofstructured https://www.ontotext.com/prMucts/ data and conversion to RDF data. During ontotext-platforml the import namespacescan be de&ed. (commercial/open source). Platform that allows transformation of https://www.ca mbridgesema ntics.coml structured and semi-structured data into productl RDF graphs. Querying data andanalysis thereof isthen possible on the graph. (commercial/open source). 0penRe&e isa reEhement tool for clean- https://openre&e.org/ ing data. ltfeaturesa built-in exporterto generate RDFEJes (open source). Trifacta Wranglerisa suite of data prepa- https://www.trifacta.com/products/ ration tools. It allows transformation wrangler-editions/#wrangler of diEjerentformats,thereby cleaning and merging data. RDFisamongthe supported formats (commercial). 1.2.3.3. Use existing vocabularies when possible Dimension Indicator Metrics Interoperability Conformity/compliance Machine readability/processability · DCAT-AP compliance of metadata · Conformity of 19e formats and Iicences · Conformity to access property values · Datafollowing a given schema · Usage of controlled vocabularies Existing vocabularies should be reused whenever possible. The Publications Office provides such vocabularies for use with DCAT-AP ("). aBad example This screenshot shows the licence of a data set referenced without using the controlled vocabulary. This makes further processing much harder and is error prone with regard to spelling. 1. Recommendations for providing high-quality data Forfurther processing it is important to use suitable data types. For example, numbers should be encoded using the number type, and Boolean values using the Boolean type. This prevents errors stemming from encoding prohibited values, for example a value other than 'true: 'false' or 'null' for Boolean fields. aBad example This screenshot shows a JSON file with various data types. All information has been encoded using strings, regardless of the underlying data type. WGood example This screenshot shows the same JSON file, this time with dedicated data types where applicable. { } "type"' "apple", "faiiTrade": "true", "a inount": "5" { } "type"' "apple", "faiiTrade": true, "a mount": 5 Helpful links and tools Title JSONLint Description This online tool checks whether your inputis valid JSON (open source). Link https://jsonlint.com 1.2.4.2. Use hierarchies for grouping data Dimension Indicator Metrics Interoperability Machine readability/processability · Processability of E8e format and media types Instead of attaching all fields to the root JSON object, data should be semantically grouped. This improves readability by humans and can enhance performance when processing the file. Also, many tools allow collapsing objects and arrays, which allows users to quickly navigate the desired information. ti 46 1. Recommendations for providing high-quality data aBad example This screenshot shows a JSON file with grouped data. All information has been attached to the root object. For objects with a larger number of fields, this can quickly reduce readability. WGood example The screenshot shows the same JSON file with semantically grouped data. { } "type": "apple", "calcium": 6.0, "magnesium": 5.0, "zinc": o.o { "type": "apple", "nutrients": { "calciuin": 6.0, "magnesillln": 5.0, "zinc": o.o } 1.2.4.3. Only use arrays when required Dimension Indicator Metrics Interoperability Machine readability/processability · Processability of ®e format and media types Data should only be encoded into arrays if the size of the list is dynamic, i.e. not known beforehand or subject to change. If this is not the case, using explicit fields makes further processing easier. In addition, it cannot be guaranteed that the values in an array are always provided in the same order, which makes the data proneto erroneous interpretation. u 47 1. Recommendations for providing high-quality data OBad example This screenshot shows a JSON file with array usage, butitis unclearwhattype of nutrients the values are referring to. Dedicated fields would have been more useful in this scenario. SGood example This screenshot shows a JSON file in which array usage is useful. { } "type"' "apple", "nutrients": [6.o, 5.0, O.O] { "type"' "apple", "nutrients": { "calcium": 6.0, "n]agnesillll]": 5.0, "zinc": o.o } } 1.2.5. APIs Please check the general recommendations in Sectior) 1.1, which also apply to APls. 1.2.5.1. Use correct status codes Dimension Indicator Metrics Accessibility Accessibility/availability ·Access URLaccessible · Download URL accessible · Downloadable without registration APls are typically available via URLs, which should be available publicly without credentials. These URLS can be called via various methods defined in HTTP. In addition to the actual payload, each server also sends a status code when answering requests from clients. These codes provide information on whether the request was served flawlessly. For example, 200 indicates no problems, whereas 404 indicates that a resource was not found. An overview of the available methods and typically used status codes is shown in Table 2. 48 1. Recommendations for providing high-quality data < Table 2. Overview of methods and status codes ' I " " k r. t k q ; :., Retrievesa resource without altering it. 200 404 '0 Uploads a new resourceto the server. 201 400, 401, 403 ' Replacesan existing resource with a new 200, 204 400, 401, 403 complete resource. " Replaces selected parts of a resource 200,204 400, 401, 403 without replacing it entirely. Deletes an existing resource from a 200,204 400, 401, 403 server. Retrievesa resource without altering it. 200 404 Uploads a new resourceto the server. 201 400, 401, 403 Replacesan existing resource with a new 200, 204 400, 401, 403 complete resource. Replaces selected parts of a resource 200,204 400, 401, 403 without replacing it entirely. Deletes an existing resource from a 200,204 400, 401, 403 server. Bad example RecNpm GET ".' ' t : //cx=ple .org/my s j This screenshot shows a GET request on a resource. However, contrary to the HTTP standard, the status code '202 Accepted' is returned. Response (D 25S)· I 202 Ac,.,,,.tu,j , Good example Requw Iv httpm/ex~le.org/my_re3QUrce This screenshot shows a GET request on a resource. As intended by the HTTP standard, the correct status code '200 OK' is returned. Respcmo (P-255S)- 2OOq¢c 49 W 1. Recommendations for providing high-quality data Helpfullinks and tools Title HTTP status codes Description This site provides a list of allstatus codes and their meanings (open source). Link https://www.iana.org/assignments/ http-status-codes/http-status-codes. xhtml 1.2.5.2. Set correct headers Dimension Reusability Indicator Accuracy Metrics · File format accuracy · Content size accuracy In addition to status codes the HTTP standard allows metadata to be encoded via headers. These are not part of the actual payload (i.e. website or resource) that is requested. However, information of interestto the consumers of the data can be encoded here. Of course, appropriate headers must be used. Also,the metadata encoded must be accurate and match the payload. A list of typical headers is shown in Table 3. Table 3. Typical headers that are used in conjunction with APls J I 0· · · lndicatesthe payload's M|ME®') type. 0 · · q · · lndicatesthe size ofthe payload in bytes. 0 · j · · lndicatesthe checksum of the payload. A checksum allowsthe userto checkifthe payload has been downloaded in its entirety and not been corrupted orchanged during transmission. If an endpoint oCBers a payload in multipleformats, this headercan be used by the clientto indicatethe desired format. Likethe Content-Type header, a MIMEtype must be speci&d. Bad example 200oi< Headers" m c«tentt-Type: text/plaiN ( This screenshot shows the two headers 'Content-Length' and 'Content-Type' returned for a GET request on a resource. However, the 'Content-Type' header is incorrect, since JSON has been returned instead of plain text. "¢cde": ?00, "descriptiocT: "Ok" } 50 (") https://www.iana.org/assign ments/media-types/media-types.xhtml 1. Recommendations for providing high-quality data < Good example R«µ·st GET Yi Headers" Accept appljcationlj3on This screenshot shows the two headers 'Content-Length' and 'Content-Type' returned for a GET request on a resource. Note that the 'Accept' header has been sent with the request, indicating the desired format of the resource. Respcxm (Q.$47S)- 200cuc Headers" CQntm·l~h: 152 ~0rvt·rype: application/json; chat>etmf-8 "cock": "description": 7M¢" ) 'h Helpful links and tools Title HTTP headers Description W3C RFC about HTTP headers (open source). Link https://www.w3.org/Protocds/rfc2616/ rfc2616-sec14.html 1.2.5.3. Use paging for large amounts of data Dimension Indicator Metrics Reusability Relevance ·Appropriateamountof data Requesting large amounts of data can easily create high loads on the server. In some cases, not all data is required, or not all at once. In order to reduce this load and increase response times, pagination should be used when applicable. This means slices of data are served instead of an entire data set. The client can state in the request which slice to retrieve, as well as its size. This is typically achieved using the parameters shown in Table4. Table 4. Pagination using offset and limit parameters ; 0 · Speci&sthe resourcefrom which to startcounting. Speci&s how many resources shall be retrieved. 51 >1. Recommendations for providing high-quality data WGood example This screenshot shows an exemplary call to an APl supporting pagination. The offset is five and the limit is three, therefore the results 6, 7 and 8 are returned. // httl)s'//delnoAGal].ol"g/al)i/3/action/package _ list?lilnit=3&offset=5 f L "help" : '1mps'//delno£lcal].ol"g/apiLq/action/help sllo\\"?nalne=package list" , "success" : true , "result" : Array [31 [ "o2a8c314-e726-44fb-88da-2e535e788675" , "o2l)-exl)ellditllre-o\"el"-25k-apr-19" , "o2l)-exl)enditllre-o\"el"-25k-feb-2o" I \ J Helpful links and tools Title Postman Description Tool for making HTTP requests (commercial / open source). Link https://www.postman.com/ 1.2.5.4. Document the APl Dimension Indicator Metrics Reusability Understandability · Description of data given · Documentation of data given APls should be specified as thoroughly as possible. This includes available paths, returned formats and status codes. If an APl allows file uploading, the expected payload should also be stated. Examples help potential users in using APls. One standard used to describe APls is OpenAPl. It allows either JSON or YAML to be used for describing APls. h 52 1. Recommendations for providing high-quality data < Example An example of an OpenAPl specification for an APl serving data about fruit can be seen in the screenshot below. openapi: 3.o.o info: version: "1.0" title: fruit-service paths: /fruit/{type}/{format}: get: summary: Returns statistics about the requested fruit operationld: getStatistics parameters: - name: type in: path description: Type of fruit required: true schema: type: string responses: '2OO': description: JSON content: application/json: schema: Sref: 'k/fruitScheina" "404": description: Resource not found Helpful links and tools Title Description Link OpenAPlSpeciWation Swagger editor Swagger Ul SpeciWation of the OpenAPl format (commercial / open source). https://swagger.io/speciWation/ An online editorforcreating and validat- https://s\'vagger.io/t00|s/swagger-editor/ ing OpenAPl speciWations(commercia1/ open source). An online visualiserfor displaying OpenAPl speciWations (commerciall open source). https:,','swagger.io,'tools,'swagger-ui/ 53 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment Introduction h With the ever increasing volume of data on the web, standardisation is becoming more and more relevant. Data which must be converted to a common format before processing hinders further usage. Data standardisation increases processability. Enrichment is the concept of linking data from external sources to existing data sets. This data can come from, among others, public authorities or open knowledge bases. Linking data can increase its value by creating new relationships and thus allowing new kinds of analysis. For example, if the database of a car authority containing ii- cence plates and models is enriched and made interoperable with data about where the cars are registered, insights can be gained into which manufacturers are preferred in certain parts of the country. Standardisation and enrichment are both part of the enriching process (see Figure 7). Profiling Distilling Enriching Documenting Validating ·Understanding ·Refiningdata ·Augmenting data ·Data usage ·Assessing data data ·Structuring data recommendations quality · Exploring data .Cleansing data "Versioning (values) Publishing ·Publishing in open, machine-readable formats Data preparation process > Figure 7. Data preparation process - Enriching The aim of this section is to give data providers actionable recommendations which enablethem to publish data setswith a high level of standardisation and enrichment. Section 2.1 contains recommendations on how to reuse concepts from controlled vocabularies. Another aspect of reusing controlled vocabularies is the harmonisation of labels, which is introduced in Sec tion 2.2. Section 2.3 focuses on recommendations regarding dereferencing label translations. Finally, Sec tion 21 gives recommendations on howto link and augment data. 54 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment U' EI 2.1. Reuse unambiguous concepts from controlled vocabularies This section covers recommendations on achieving a high level of standardisation and on enriching data. A higher level of data standardisation can be achieved by integrating RDF vocabularies such as lists of authorities, taxonomies, classifications or terminologies into the data. These controlled vocabularies describe, identify and organisethe concepts unambiguously in their area of expertise and can be reused to harmonise or augment the data. In RDF vocabularies, each concept is identified by a unique resource identifier (URI), enabling any system to referto it unambiguously. This is important, as it allows these concepts to be referenced from anywhere once they have been published on the web. These references then form a web of linked data, i.e. the semantic web ("). Using URIs that are only valid and/or unique within a certain namespace would fail to achieve this. Example The European multilingual classification of skills, competences, qualifications and occupations (ESCO) works as a dictionary, describing, identifying and classifying professional occupations, skills and qualifications relevant for the EU labour market and education and training. Those concepts are reused in different online platforms to use ESCO for services such as matching jobseekers to jobs on the basis of their skills or suggesting training to people who wantto reskill or upskill. The Publications Office maintains a number of EU Vocabularies and Authority Tables used in data.europa.eu in order to standardise the metadata (extension of DCAT-AP), as can be seen in the screenshots below. Q Law European data Public procurement EU Publicatmns Researc h & lnnovattcm EU Whotswho EU Vocabularies ·: Shar· ® H·lp Home Corurolled v'o 8usiness collearons > DCAT-AP for data portals m Europe > DCAT-AP-OP DCAT-AP-OP Comrolkd vocabularies used in the EU Open Data Ponalin order to standardise the metadata (extension of DCAT-AP): &ixmsjtgM Qmtkme Qmsmtuus Quasm¥R!g f1jtquency oimmm= Lamua9c c~n£m=uLwRubaus L&ence Emy« N«auon wac Eikmte Im rc11qs1 d' ('I) https://www.w3.org/standards/semanticweb/ 55 >2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment Home Controlled vocabularies " Models " Business coIlecuons " Releases Q ) EU Vocabularies > Controlled vocabulanes ) Authority tables Authority tables In order [d harmonise and standardise rhe codes and rbe assocmred labels used in the publicauons Office and for inter-msuruuonaj data exchange (i.e.: between the lnsutuuons Involved m the legal dectsoon-makmg process gathered in the mterinsututional Metadata Maintenance Commiuee Iimmc]). a number of Named Authority Lists (NALS) have been defined. The following NALS are maintained m the Metadata Registry: h Filter bT- '4 Access right A/cce$50b|lKY Accred0tubon Address rype AdmimstratNe terriwrial unu Admimstratwe rerruoeW ump rype Applicability Assessment Assec classification Award critervon type Browser Buyer legal type COM mternal consultation type COM onternal event COM onternal procedure Cap«a1 dassificamn Case repocl Case sratu s Change corng jusufmrion Communicamn channel Communicauon channel usg Communicauon jusuficamn Concept status Comment Concracc nature Corporate body Corporate body classification Correction status Country Court type Crawler Credentoal EnUdement slaws Envlronmemal impact Event Fvk sratus File type Form type Formation of the Court Framework agreement Frequency Grammaucal alcernatoon Grammacocal consciousnes s Grammancal gender Grammatical number Hononfic Human sex Innovative acquismon lntermmuuonal procedure Internal procedure Irregulanry type Label type Language Learning acuvuy Learning and verificarion Learning assessment Learning Qppcxtunity Learning schedule Learning seumg Legal basis Legal proceeduig Legal proceedmg result Legal proceedong type Licence OrganlsaUon subrole Cmganlzatbon type Ocher place service permission place PQsition grade Position status PQ$ition type Procedure nature Procedure phase Procurement procedure type Product form Public event type Publicauon theme Recerved submssion type Remedy type Requirement stage Reserved procurement Resource type Rcview body type Rcvvew d«ision type Role Role nature Role qualiher Scoring SCript Seleaion criterion S4kc Social objective Strategic priority Strategic procu rement Subcontracung imhcation 2.2. Harmonise the tables Instead of hardcoding labels into data, these labels can be referenced by unique identifiers, i.e. URIs. This means that if those labels change, the reference does not need to be adjusted, reducing the burden of maintenance for data providers. Example The example below shows a sample of data from Erasmus statistics: STUDENTJD XI X3 X6 H0MEJNSTITUTION_CDE H0MEJNSMJTION_ STUDENT_AGE_VALUE STUDENT_GENDER_CDE E ALICANT01 ES 21 F D KOLN07 DE 25 F SF TURKU01 FI 21 F STUDENT NATIONALITY ES DE A 56 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment j¢ The value provided in the 'student nationality' or 'home institutions' fields can be standardised based on the Countrytable.lnstead of encoding the country code (here: ES, DE or Fl),the corresponding unique identifiers forthese countries can be provided and additional data can be derived from the country identifier, such as the country label orthe country ISO code (two or three letters). HoMEjNsTmmoNBRY. lklique|dentmcolmTatk) Labd(CowtryTatk) |s031*2((MmTatkj |S031*3(CowMyTatk) E$ hm'/lµmti~lrK£d'mu'cdam/cwnt'ylE$p »in 6 ESP DE http'/lp'l&ati~lrNNre50urcekhitdcwmulDEU 6emM DE GER El Finlud FI FIN d 2.3. Dereference the translation of a label Once the labels are indicated bythe unique identifiers from the controlled vocabularies, the URIS can be dereferenced. This allows the label to be resolved in any language supported by the controlled vocabulary. Example The example illustrates the 'meter' concept in the Measurement Unit table. The 'meter' concept is represented by different preferred labels (prefLabel) in the different EU official languages. Assigning the 'meter' concept URI into your data set enables the automatic dereferencing of the different language versions and offers enhanced access to your data. In addition, even if one translation is updated in the Authority Table, there is no need to update the 'meter' concept. The URI will automatically dereference the right value from the table. 57 :jj2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment The two screenshots below show the 'meter' concept in RDF (top) from the corresponding Measurement Unit table from the EU Vocabularies website (bottom). /' <$k0$:concept rdf:abQUt·"http://Nbli¢#i%$.euNµ.0u/M$wr¢e/autMrity/Ma$UMMne-unit/wR°' at:d¢pm¢md·"fa1$0"> <$kos:inSeheme rdf:re$oUr¢e·"http://pOn¢mon$.europa.0U/re$oUr¢e/aUthority/Ma$uMMnt-unit"/> MTR MTR m m l952-07-23 «kos:prefUbe1 xml:lang·"bB"~Tbp</$ko$:prefLabel> <$kos:prefLabei xml:lang·"da">mter</$ko$:prefLabel> <$kos:prefLabe1 xml:lang·"de"~er</$kos:prefLabel> <$kos:prefLabei xml:lang="el">µE¢pQ</$ko$:prefLabel> meter §re !dar mtarj fer mtri ~ras mtru ~er ~r mtro ~ru ~er ~er ~ro mter Concept scheme Measurement u nit Authority Table url http:t ,' publications.europa.eu/resource authority ' measurement-unit About Browse contenr Documenrauon Filter by: q, Code .- Label : 2N decibel 3c manmonth AD byre AMP ampere Valid since ; Valid until 1952-07-23 1952-07-23 1952-07-23 1952-07-23 Predeces sor Successor BAR bar 1952-07-23 BIT bit 1952-07-23 BQL becquerel 1952-07-23 C34 mole 1952-07-23 C45 nanometre 1952-07-23 col candela 1952-07-23 Definition The decibel (dB) is a uni or the logarithm of a rat The manmonth is a unit The byre is a unit of infc The ampere (A) is the N charges, such as electro The bar (bar) is a non-Sl 000 Pa, JO' Pa, which is The bit lb), short for "bn computing and digital Co no decision at the Iowes The becquerel (Bq) is a c radioactive material in The mole (md) is the ba chemical substance that The nanometre (nm) iS (O,OOOOOOO0) m). The candela (ed) is the t intensity, in a given dire intensitv in that directio 58 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment U' EI 2.4. Linking and augmenting your data Consistent use of unique identifiers also allows linkage and augmentation with external data. This adds value to existing data by linking to new concepts or aspects of existing data. Optimal usage of controlled vocabularies can be achieved using a four-star data format such as RDF or JSON-LD. Example The screenshot below shows a dataset, which lists the names of common cosmetic ingredients with their corresponding chemical abstract registry number (CAS number). Reference number Chemical name / INN Name of Com mon Ingredients Gbossan cas Number EC Number 1 Benzoic acid and ftS sodium salt BENZOIC ACID; SODIUM BENZOATE 65-85-0 / 532-32-1 :KXJ61&2 / 208-534-8 la $3it$ of bentoic acid cNher Chan tPAMMQNIIjM BENZQATE / BUTYL BENZK 1863-63·4 /2Q9Q-Q5·3/ $82·25·2 217·468·9/ 218·235·4 / 2 PTopiomc acA and its salts PROPIONIC ACID / AMMONQUM PROPK 79·09-4 / 17496·08·1/ 4075·81-4 2!01·17&3 / 241·503·7 / 3 SaNeyhc aad and its salts SALICYUC ACID, CALCIUM SAUCYLATE, 69-72-7(1]/ 824-35-1[2j/ 18917-8 XX>712-3[1)/ 212-525-4 4 Hexa -2,4-dicnok acid and its SORBIC ACID /CALCIUM 5QRBATE/ SQ IIQ-44-1/ 7492-$$·9/ 7757·81·5 ?03-768-7/ 231·321-6/ 7 Biphenyl -2-ol O-PHENYLPHENQL 90·43·7 201·993-5 9 lnorpnie sulphites and hydrogen SODIUM SULFITE / AMMONIUM BI91JLF 7757-83-7 / 10192-M / 101964 231-821-4 / 233-4H-7 / 11 ChbrobuUn0l CHLOROBUTANQL 57-15-8 ?QQ-317-G 12 4 ·Hydroxybenzok acid and its sal4-HYDROXYBENZOIC ACID / METHYLPA 99·96·7 /99·7&3 /36457·19·9/1t 202·804·9 /202·785·7 /2 Linking the CAS number with the corresponding value in the ChemicalEntities ofBio- logical/nterestdictionary (ChEBl) would augment the data set with new derived data (synonyms, standardised identifiers and cross references). ChEBl is a freely available dictionary of molecular entities focused on 'small' chemical compounds and is shown in the screenshot below. benzoic acid Mlp ".Pud we 0c¢vpy %arc h Chebi xmi Acx)cwoUMco¢TKKm a nmcxmc8[TywacmmK aMsUbsb0ueM SNmqvy~ ~06¢ m4||0Emoc bcKN5€ 0¢10 +Ttee rnew Term d Term hrskxy "' .chemcal mo«ijW onb¢y i"' man Hp mmar ~y 0 p&Kck nUocmr eemy G) qmp mooecular em«y El cMtly entny ccmpocjM g cmp«m cmbon OxO«X! cartoxyb¢ acid scomm CMXucyK 8C¶ WtZOoC 8Ckj$ Term information · Ufb8M8c mom rMbmm · NEGG COQ539 · ±cYccxm38 · LINGS LSM-37118 · M~yc BENZQATE FT)BcCNm BEZ · Gmehn 29464Gmehn) · cas ChemW V\Nb8ocdc) · DWBdnk 08®793 b · PPOB 1475 . PUll) 1672%54 1EW(JQ0 PMC) · PMID 1K3143361Ew%cPMC) . FIMDB HKADBQQQ187Q · CAS . 636131 (Rmys) M 59 |@' >2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment For example, CAS number 65-85-0 (benzoic acid) has the identifier CHEBI:30746 represented by the URI (indicated by the red arrow). Another example is illustrated by two data sets published in JSON-LD - 'Pesticide (New)' and 'Pesticide-EPPO'. They contain data from the collection of the single active substances and their maximum residue levels (MRLS) related to foodstuffs intended for human oranimal consumption in the European Union. The data set, which is shown in the screenshot below, corresponds to the foodstuff product 'Orange'. { "@id" : "http'//data.europa.ell/dph/id/pesticides/prodllct/o11oo2o", "@type" : "http'//data.ellropa.ell/dpll/def/pesticides#Pl"odllct", "has Parent Product" : "http'//data.ellropa.ell/dph/id/pesticides/pl"odllct/o11oooo", "productTvpe" : "http'//data.ellropa.ell/dph/id/pesticides/pl"odllctT\pe/4", "label" : [{ "@1anguage" : "cs", "@value" : "PoineranCe" }, { "@1anguage" : "int", "@value" : "Laring" }, { "@1anguage" : "pt", "@value" : "Laranjas" }, { "@1anguage" : "pI", "@value" : "Pomarahcze" }, { "@1anguage" : "nl", "@value" : "Sinaasappelen" }, { "@1anguage" : "lv", "@value" : "Apelsini" }, { "@1anguage" : "it", "@value" : "Arance dolci" }, { "@1anguage" : "It", "@value" : "Apelsinai" }, { "@lal)gllage" : "hr", "@value" : "NaranCa" }, 60 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment j¢ { "@1anguage" : "fr", "@value" : "Oranges" }, { "@1anguage" : "en", "@va1ue" : "Oranges" },{ "@1anguage" : "el", "@value" : "IIopToKakla" },{ "@1anguage" : "fi", "@value" : "Appelsiinit" }, { "@1anguage" : "es", "@value" : "Naranjas" }, { "@1anguage" : "et", "@value" : "Apelsinid" }, { "@1anguage" : "de", "@value" : "Orangen" }, { "@1anguage" : "la", "@value" : "Citrus sinensis" }, { "@1anguage" : "da", "@value" : "Appelsiner" },{ "@1anguage" : "da", "@va1ue" : "Appelsiner" }, { "@1anguage" : "hu", "@va1ue" : "Narancs" }, { "@1anguage" : "bg", "@value" : "IIopToKajIN' },{ "@1anguage" : "sk", "@value" : "pomaranCe" }, { "@1anguage" : "ro", "@value" : "Portocale" }, { "@1anguage" : "sv", "@value" : "Apelsiner" }, { "@1anguage" : "sI", "@value" : "PomaranCe" }] } 61 >2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment This foodstuff product contains 'Fenoxicarb pesticide residues' identified by the URI . This relationship is shown in the screenshot of the MRL data set below (the 'Orange' foodstuff is highlighted in blue). { "@ic1" : "http'//data.ellropa.ell/dph/id/pesticides/lllrlHst/1oo283", "@type" : "http'//data.ellropa.ell/dph/def/pesticides#Maxill1llll1ResidlleLevel", "application Date" : "2OO8-09-01TOO:OO:OO", "|))l'l\'allle" : "2", "product" :"hW'//data.eluopa.ell/dph/id/pesHcides/prodllct/o11oo2o", "residue" : "http'//data.ellropa.ell/dph/id/pesHcides/sllbstance/299" } All information regarding this pesticide can be retrieved by dereferencing the URI highlighted in green.This yields the data shown in the screenshot below: { "@ic1" : "http'//data.ellropa.ell/dph/id/pesticides/sllbstance/299", "@t.' pc" : [ "http'//data.ellropa.ell/dph/def/pesticides#Residlle", "http'//data.ellropa.ell/dph/def/pesticides#Sllbstance" ], 'u)asSllL)sta]lce("'c)(le" : "SO299OO", "isLegislated By" ' [ "httµ//data.ellropa.ell/dph/id/pesticides/legalResollrce/pllblication-473", "httµ//data.ellropa.ell/dph/id/pesticides/legalResollrce/pllblication-518", "httµ//data.ellropa.ell/dph/id/pesticides/legalResollrce/pllblication-1" ], "isMemberOf" : "httµ//data.ellropa.ell/dph/id/pesticides/sllbstance/299", "label" : [ { "@1anguage" : "pI", "@value" : "Fenoksykarb" }, { "@1anguage" : "it", "@value" : "Fenoxicarb" }, { "@1anguage" : "es", "@value" : "Fenoxicarb" }, { '©language" : "ro", "@value" : "Fenoxicarb" }, { "@1anguage" : "bg", "@value" : '"PeHoKcI4Kap6" }, { "@1anguage" : "pt", "@value" : "Fenoxicarbe" }, { "@1anguage" : "fi", "@value" : "Fenoksikarbi" 62 }' 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment j¢ { "@1anguage" : "It", "@value" : "Fenoksikarbas" }, { "@1anguage" : "sk", "@va1ue" : "Fenoxykarb" },{ "@1anguage" : "lv", "@value" : "Fenoksikarbs" },{ "@1anguage" : "cs", "@value" : "Fenoxykarb" }, { "@1anguage" : "Sv", "@value" : "Fenoxikarb" }, { "@1anguage" : "hu" "@value" : "Fenoxiiarb" }, { "@1anguage" : "de", "@value" : "Fenoxycarb" }, { "@1anguage" : "hr", "@value" : "Fenoxycarb" }, { "@1anguage" : "nl", "@value" : "Fenoxycarb" }, { "@1anguage" : "int", "@va1ue" : "Fenoxycarb" }, { "@1anguage" : "fr", "@va1ue" : "Fenoxycarb" }, { "@1anguage" : "da", "@value" : "Fenoxycarb" },{ "@1anguage" : "en", "@value" : "Fenoxycarb" }, { "@1anguage" : "et", "@value" : "Fenoksiikarb" }, { "@1anguage" : "sI", "@value" : "Fenoksikarb" }, { "@1anguage" : "el", "@value" : "0evoEuk(1pµa (Fenoxycarb)" }] }, 63 2. Recommendations for data standardisation (with EU controlled vocabularies) and data enrichment The 'Pesticide-EPPO' data set contains cross references between the entities contained in the 'Pesticides (New)' data set and the items in the EPPO Global Database. More specifically, the data setlinks the instances ofthe'Pesticides (New) Product' class to the possible corresponding EPPO Global Database items, which enables a five-star ranking. Helpfullinks and tools Title EU Vocabulariesand Authority Tables OpenReEhe (OntoText) Description EU Vocabulariesand Authority Tables have been developed forthe Publications 0®ce in orderto facilitate the exchange of data between the di&rent information systems of the EU institutions (legislation, callsfortender, etc.) and describe data sets (open source). Tool for cleaning and extending data from external sources (open sou fcc). Cleaning data with 0penRe&e Thisarticle describes how to discover inconsistencies in data and how to diagnosethe accuracy of data with 0penRe&e ('°). 0ntoRe&e ata transformationtoolthat can be used forconvertingtabular data into RDF (commercial / open source). Link https://op.europa.eu/en/web/ eu-vocabularies/authority-tabks https://op.europa.eu/en/web/ eu-vocabularies/dcat-ap-op https://openre@e.org/ https://openre&e.org/ documentation.html https://github.com/OpenRe@e/ 0penRe&e https://doaj.org/artide/3 ccd075407a4481c85cOdOOd65a003cO http://graphdb.ontotext. com/documentation/free/ loading-data-using-ontore&e.htm# ontore&e-overview-a M-features m (") Dc Wilde, M., van Hooland, S. and Verborgh, R. (2013),'Cleaning data with OpenRefine: The Programming Historian, 1 August 2013, Editorial Board of the Programming Historian, United Kingdom. 64 3. Recommendations for documenting data W 3. Recommendations for documenting data d Introduction More and more data is published on the web every day. However, in orderto improve interoperability and ease further processing, this data has to be documented.This way users can know what to expect with regard to both syntax (i.e. structure) and semantics (i.e. content). In addition to improving data quality for users, documentation can enhance the value of data, as misinterpretation of data becomes less likelywhen context is provided. This document covers aspects relevant to step four of the data preparation process, which is shown in Figure 8. Profiling Distilling Enriching Documenting Validating Publishing ·Understanding data ·Exploring data (values) ·Refining data ·Augmenting data ·Structuring data ·Cleansing data ·Data usage recommendations ·Versioning ·Assessing data ·Publishing quality inopen, machine-readable formats Data preparation process > Figure 8. Data preparation process - Documenting The aim of this section is to give actionable recommendations covering the tasks involved in documenting data, aided by tools. This includes documenting structure and meaning, as well as properversioning. A general recommendation on where to publish documentation is given in Sec tion 3.1. Sec tioii 3.2 contains recommendations on using schemas to document data structures. In addition to the structure, the meaning of data should also be documented, which is covered in Sec tion 3.3. Sec tioii 3." contains recommendations on the various aspects of documenting data changes. 3.1. Publish your documentation The topics covered in this section include describing data structures, i.e. the internal representation of files, and tracking changes of data. Developing a DMP before publishing data is crucial for achieving a coherent data structure. The plan should cover aspects such as expected/targeted data models, whether raw data will be used and how the data will be processed. Sec tioii 1.1 contains more information about DMPS. 65 W 3. Recommendations for documenting data Regardless ofthe formatorfile type used fordocumenting data,itis vitalthatthis documentation be published alongside the data, ideally in a separate distribution. This distribution should then be linked to the data itself via the dct.'conformsTo property of data sets / distributions specified by the DCAT-AP (") standard.This applies regardless of the file format used. WExamPle This screenshot shows a distribution using the dct.'conformsTo property to link to another distribution containing a schema specifying the data's structure. Truck parking static data 3.2. Use schemas to specify data structure Despite format standards specifying the internal structure with regard to syntax and permitted keywords and identifiers, the data publisher can choose the way data is written to a file (i.e. serialised). For further processing, however, this serialisation must be known to the user by means of data schemas. Instead of expecting the user to download and analyse the data, the serialisation schema can also be specified separately, often in a dedicated format. The following sections provide descriptions of these schema languages and an overview of schema specifications forthe most commonly used formats, namely JSON, XML, CSV and RDF. 3.2.1. How to specify JSON data structures The schema language used forjSON files is called JSON Schema ("). The schemas are JSON files themselves, but contain information describing a data structure that can be resembled as JSON. Data providers should publish a JSON schema that specifies the JSON structure along with their data. 'h (") httpsj/joinup.ec.europa .eu/m|ution/dcat-app|ication-pron|e-data-porta|5-europe/re|ease5 (") httpsj/json-schema.org/ 66 3. Recommendations for documenting data W WExamPle An example of an OpenAPl specification for an APl serving data about fruit can be seen in the screenshot below. { "$scheina": "httµ//json-schelna.org/draft-o7/schelna#", "description": "Employee data", "type": "object", "properties": { "name": { "type": "string" }, "age": { "type": "nuinber", "lninilnllln": o, "lnaxilllllll]": 1oo } } } Helpful links and tools Title JSON Schema JSON schema generator Description Link Vocabdarythatallows usersto annotate https://json-schema.org/ and validatejSON documents (open source). 0nlinetool that generatesa schema from https://jsonschema.net existing JSON data (open source). 3.2.2. How to specify XML data structures There are multiple schema languages for specifying the structure of XML files, for example RELAX NG (") and Schematron ("). XSD (XML Schema Definition Language) is recommended by the W3C and thus also endorsed in this document. An XSD file itself consists of XML. It is made up of two parts: structures (") and data types ("). As the names suggest, the former defines the structural part of XSD whereas the latter defines data types that can be used in XSD. Overall, XSD specifies exactly which elements/attributes are allowed and what data type the content must have. It is also possible to specify patterns to checkforthe correctness of data formats, such as postal (") https://relaxng.org/ (") http://schematron.com/ (") https://www.w3.org/TRb rril .. I'.: ·rri,' ' ' -' (") https://www.w3.org/TRh [r|| ,. 1'( ·rri,' ' ' -,' 67 |@' :113. Recommendations fordocumenting data codes, during validation. Data providers should publish XSD schemas that specifythe XML structure alongside their data. SExamPle The screenshot shows an XSD schema which specifies the structure of sample data from the fruit domain. For example, it states that the 'drupe' value can either be true orfalse, not unknown. Helpful links and tools Title Liquid Studio XMLFox Description Link XMLschema editorwhichallows https://www.liquid-techndogies.com/ generation of XSD Mesfrom existing XML xml-schema-editor (commercial). XML editorthatfeatures XSD validation https://www.xmlfox.com/ (commercial / open source). 3.2.3. How to specify CSV data structures Frictionless Data (") have developed a CSV table schema expressible in JSON. This means that the structure to which a CSV file must cohere is described in a JSON file. At the time of writing, dedicated tooling support for creating Frictionless Data schemas is only available as libraries for various programming languages. However, since Frictionless Data is specified using JSON, any text editor with JSON support can be used forthis task. (") httl)>: Iii' icrk',.':1,' ,'.11) 68 3. Recommendations for documenting data W The UK NationalArchives ('") have also published a CSV schema language (") that can be used to describe the content of CSV files. It can be used to specify, among other things, the number of columns, whether values are mandatory or optional and what data range applies. Data providers should publish schemas in either the Frictionless Data or National Archives formats which specifythe CSVtable alongside the data. Example This screenshot shows the Frictionless Data schema for fictional employee data. Note the restrictions: department names can only consist of capital letters and the numbers 1 to 4, and employees are either retired or not. { "fields": [ { "na me": "name", "type": "string", "description": "E inployee's na inc" }, { "name": "departinent", "type": "string", "description": "Department ID" "constraints": { "pattern": "[A-Z]{1,4}" } }, { "na me": "retired", "type": "boolean", "description": "Employee status" } ] } (") https://www.nationalarchives.gov.uk/ (") http://digita|1)reservationgithubjo/cw-5chema/csv-schema4.1.htm| 69 :113. Recommendations for documenting data Example This screenshot shows the National Archives CSV schema for fictional employee data. It contains the same restrictions as in the previous example. Helpful links and tools version 1.1 @separator ";" @totalColuinns 3 name: unique notEinpty department: regex("[A-Z]{1,4}") retired: is("yes") or is("no") Title Description Link CSV Validator Cross-platform desktop application, |)ttps:,','(||g|ta|-[)l'eservation.github.io/ command-line utilityand programming csv-vahdatorl librarysuitabkforvalidating CSVMes againstthe National Archives CSV schema. CSV Ne: CSV Schema fik: Settings V: WunckrdomKR Cards metadata\ADM 36N3\tech _ acq jnetadata _V1 _ADM36~3.CSV Y:Wunderbhn\ADM _—. ] Fad on Nst error? pam From NC:///ADM,362/ To Ne:///V:/"dunderdo'^nAjAT _ 2/ADM3628003/ADM... Add Path Sd)sbhjt'onY '"" @¢o¢alColuml3 " qj buc number of column3 defined " q2 ac line: 2, column: 1 3.2.4. How to specify RDF data structures The primewayofdefining thestructureof RDF graphs is using ontologies.Thestructure of RDF can also be specified using schemas. SHACL ('°) (Shapes Constraint Language) is a powerful concept that allows validation against these schemas. It specifies a syntax that can be used to define conditions incoming RDF must cohere with. Data (30) http" ' "["i ||" ·1'," I 70 3. Recommendations for documenting data providers should publish SHACL shape files that specify the RDF structure in addition to the actual data. WExamPle This screenshot shows the SHACL shape file that specifies personal data. Note the constraint that the date of birth must be earlier than the date of death. @prefix schema: . @prefix sh: . @prefix xsd: . schema:PersonShape a sh:NodeShape; sh:targetClass schema:Person ; sh:property [ sh:path schema:givenName ; sh:datatype xsd:string ; sh:name "given name" ; ]; sh:property [ sh:path schema:birthOate ; sh:lessThan schema:deathOate ; sh:maxCount t; Ip If the sample data is validated against this SHACL [ file, the following report is generated. As expected, the mismatch between birth date and death date is detected as a violation. Ip ash:ValidationResult; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:LessThanConstraiMomponent; sh:sourceShape _ :n703 ; sh:focusNode ; sh:resultPath schema:birthOate; sh:value "1971-07-07"; sh:resultMessage "Value is not < value of schema:deathOate"; The examples in this section are adapted from those provided by SHACL Playground ("). (") https://shacLorg/pkyground/ m 71 W 3. Recommendations for documenting data Helpful links and tools Title TopBraid Composer SHACL Playground Description Standalone SHACL validator (commercial). Web-based SHACL validation tool (open source). Link https://\4'w\4'.topquadrant.com,' products/topbraid-composer/ https://shad.org/playground/ 3.2.5. How to specify APIS APls are not files themselves, but serve data on the web, accessible by URL. For users to be able to easily use an APt it must be thoroughly documented. Data providers should document not only the structure of served data, but also how this data can be accessed on the web. Depending on the APl's protocol, different documentation methods may be applicable. HTTP APls should be documented according to the OpenAPl (") standard. This allows, among other operations, specification of URLS, HTTP status codes and structure of payloads (i.e. what the served data looks like). OpenAPl specifications can be written in eitherjSON or YAML. Recommendations on good APl design are given in Section 1.2. The following aspects of an APl should be specified: · URLS and endpoints; · the protocol(s) of the endpoints (e.g. HTTP, FTP); · access methods (e.g. HTTP methods, status codes); · ways to alter results (eg. query parameters, HTTP headers). Additionally, the semantic meaning of the served data should be explained. (") http ·: . .1.·. i , [.i . .i 1 72 3. Recommendations for documenting data W Example This screenshot shows a truncated snippet of an OpenAPl specification that defines the EU ODP'S APl for retrieving a data set ("). Aside from ensuring a sound structure with all mandatory fields set, the specification should be complete and exhaustive with regard to the aspects mentioned above. Meaningful descriptions and summaries help grasp the semantic meaning of the data served. paths: '/data set/{data set1d}.rdf': get: summary: Retrieve data set in RDF/XML description: > Return the full content of the data set in RDF operationld: getData set parameters: - name: data setld in: path description: data set identifier required: true style: simple explode: false schema: type: string responses: '2OO': description: OK content: application/rdf+xm1: schema: type: string '404': description: not found content: text/html: schema: type: string m (") https://app.swaggerhub.' ' Iii . ['I I ' '-' '[ " 'I -[ ". . -F' i . I "i -' [:' 'i _'kta _ portal/0.8.0 73 :113. Recommendations for documenting data Helpfullinks and tools Title Description Link Swagger Tooling that aids in editing and validating https://swagger.io/ 0penAPlspeci&ations(commercid/ open source). OpenAPl speci19cation Standard that de&es OpenAPl (commer- https://swagger.io/speci&ation/ cial / open source). 3.3. Document the semantics of data Depending on its complexity, publishing a schema is not always sufficient. While a schema describes the syntax and structure, it does not explain the semantics of data. A description ofthe individual properties of a data structure helps users interpret and reuse data correctly and in the way intended by the data provider. SExamPle The first screenshot shows a data set which links both the schema and the semantic description of its data. Note that all three link to a dedicated distribution, which contains links for accessing the files. The second screenshot shows a snippet of the HTML document of the semantic documentation. Documentation ,L DOWNLOAD ± DOWNLOAD G VISIT PAGE Consolidated Financial Sanctions File (XSD schema 1.0) xml schema Consolidated Financial Sanctions File (XSD schema 7. I) xml schema Foreign Policy Instruments website (Sanctions) I html_ 74 3. Recommendations for documenting data 4. Sanctions Schema Information The folbMng table pruvides a detaled desaip¢ion of the fields availabk in tm response of pnedelined query for Samtions Regrster. Field Name sn _sanctionEsmalD sn ,entityEsma[0 sn_ ncaCodeFdlName sr) _entltyLega|Fram~rkStr sn >andionLegalFramewcxkName sn _expiration0ate sn_date Field Type vnt text,genera1 string string string date date Comment Esm ID which uniquely idenl8fies a sanction, Enuty Esma ID uniWe|y binds a sanction to the autkxised entity. Sanction NCA FuN Name. ESMA registered entity's (Authonsed and registered entities) Legal Framemc Sanction's Legal Framework. Sanction Expiration Date. Sanction Date. Helpful links and tools Title Sphinx Readthe Docs Description Link Tool forcreating docu mentation. https://www.sphinx-doc.org/en/master/ Supports, among others, HTML, PDF and index.html plain text formats (open source). Open-source hosting service for docu- https://readthedocs.org/ mentation, for example those generated using Sphinx (open sou rec). 3.4. Document data changes Data is likelyto change overtime. For example, schedules for public transport may be updated during roadworks, and if a new politician is elected their name may be added to the list of elected representatives. It is important to document all such changes. More precisely, users must knowthat data has changed,what has changed and where to find other versions of the data.This section contains recommendations covering all three aspects. 3.4.1. Adopt a data set release policy When you haveto updateyourdata,itisimportantto considerthefollowing questions. ·What constitutes a change in the data set? ·What isthe impact of the new release: is it a major or minor change in the data? · What is the importance of the change from the reuser's erspective? The data set release policy can be defined in the DMP. Steps include, among others, defining the file naming convention, release number and update frequency. Section 1.1 contains recommendations for creating DMPS. 75 W 3. Recommendations for documenting data 3.4.2. Differentiate between a major and a minor release of a data set If a new instance of a data set is different from its predecessor it can be considered as a new major release, meaning it is recommended that a new entry for this data set be created in the data catalogue. If the change in the data is minorand does notimpactthe reuser, itis recommended thatthe data setdescription be updated in the data catalogue. WExamPle Eurobarometer studies monitor public opinion in the European Union Member States and candidate countries. The survey results are regularly published in official reports. Each data set is part of a collection (Eurobarometer) and results in a succession of generated data sets. Each data set in the collection is identified and versioned. ·11 Flash Eurobarometer 451: Business perceptions of regulation This Flash Eurobarometer survey looks at the perceptions of businesses from all EU 28 Member States regarding legislation and regulations applying to them. The results by volumes are... HTTp//pUBLIcATIoNsEURopA£U/REsoURcE/AU'rHoRfTYmLE-TYpE/op_DATpRo I ZIP I (1254 views) (1104 Downloads) ·11 Flash Eurobarometer 349: Flash Eurobarometer on the introduction of the euro (non eurozone) A regular survey which takes stock of attitudes concerning the introduction of the euro in the countries which have not yet done so. The results by volumes are distributed as follows:... HTTP //pUBLIcATDNs.EURQpA.EU/REsoURcE/AwmRIw/FILE-wpE/Qp_DATKo ZIP (816 views) (701 Downloads) ·11 Standard Eurobarometer 63 Eurobarometer public opinion surveys ("standard Eurobarometer surveys') have been conducted on behalf of the Directorate-General for Information, Communication, Culture, Audiovisual of... HTTP://PUBLlCATlONS£UROPA£U/RESOURCE/AUTHORITY/HLE-TYPE/OP .DATPRO ZIP ' (1629 views) (I 536 Downloads) ·11 Standard Eurobarometer 67 Eurobarometer public opinion surveys ("standard Eurobarometer surveys') have been conducted on behalf of the Directorate-General for Information, Communication, Culture, Audiovisual of... HTTP //PUBLlCATlONS.EUROPA£U/RESOURCE/AUTHORlTY/HLE-TYPE/OP,DATPRO ZiP; (1623 views) (1523 Downloads) ,. P 3. Recommendations for documenting data Example (minor change) In the example below, the date of modification has been updated afterthe data have been updated. CORDIS - EU research prqjects under FP7 (2OO7-2013) jv FP7 projects (individual XML files) ,L DOWNLOAD ' h Description A zip file containing the full indNidual xml files of all FP7 projects existing in CORDIS database * Format ZIP (J Additional Information Access URL https: I Kordis.europa.eu/data/cordis-fp 7projects-xml.zip l" Status completed Release Date 2015-09-25 Modified Date 2021 -03-16 Resource Type Downloadable file µ 77 W 3. Recommendations for documenting data SExamPle (major change) In the screenshot below, a new version of a data set has been added under the resources (version 1.1) as well as the documentation (XSD schema 1.1). Resources ± DOWNLOAD ± DOWNLOAD ± DOWNLOAD ± DOWNLOAD Consolidated Financial Sanctions File 1.0 csg Consolidated F/nanc/a/Sanct/ons File LO X® Consolidated Financial Sanctions File 1. 7 csg Consolidated Financial Sanctions File 1. 1 X® Resources £ download Consolidated Financial Sanctions File (XSD schema /.0) xml schema £ DOWNLOAD Consolidated Financial Sanctions File (XSD schema /. I) xml schema 3.4.3. Indicate a data set's version (release) number There are a multitude of conventions concerning when and howto incrementversion numbers. In the spirit of standardisation, it is advisable to adhere to commonly used specifications when choosing version numbers. One such standard is called semantic versioning ("). It states that version numbers must consist of three digits, separated by dots, for example '1.2.3'. The first digit declares the major version, the second digit the minor version and the last digit the patch version. Other methods of versioning exist, for example using digital object identifiers (DOls). A new DOl is assigned for each version of a document. The DOls are generated and maintained by central authorities in order to guarantee the uniqueness of the numbers. The owL'version/nfo property should be used to indicate the version of a data set. Additionally, the dct:modified property should be used to state the date of the latest modification of the data set or distribution. (") https://semver.org/ 78 3. Recommendations for documenting data Example The screenshot shows the same data set with the owL'versionlnfo and dct:modi/ied property set. The former is specified using semantic versioning. Honorific Named Authority List 1.2.1 2020-01-08T08:39'53 2016-11-18 3.4.4. Describe what has changed As stated earlier, it should not only be indicated that data has changed, but also what has changed. This is ideally documented in a separate document, which should be linked via the foaf'page property of a data set or distribution. Example This screenshot shows a data set with the foaE'page property set. Honorific NamedAuthority List 1.2.1 2020-0I-08T08:39'53'/dcerms:m0dified> 2016-11-I8 b 79 :113. Recommendations for documenting data This screenshots shows the properties adms.'identifier, dct:modi/ied and adms:versionNotes. OpenFoodTox: EFSA'S chemical hazards database 0 < h Description In food safety, hazard identification and hazard characterisation aim to determine safe levels of exposure for substances "reference values" to protect human health, animal health or the environment. Such reference values are most often derived for the relevant species by applying an uncertainty factor on the "reference point determined from the pivotal toxicological study. Since its creation in 2002, EFSA scientific panels and staff have produced risk assessments for more than 4,400 substances in over 1 ,650 scientific opinions, statements and conclusions through the work of its scientists. OpenFoodTox is a structured database summarising the outcome of hazard characterisation for human health and - depending on the relevant legislation and intended uses - animal health and the environment. For each individual substance, the data model of OpenFoodTox has been designed using DECO Harmonised Template as a basis to collect and structure the data in a harmonised manner. OpenFoodTox reports the substance characterisation, EFSA outputs, reference points, reference values and genotoxicity. In order to disseminate OpenFoodTox to a wider community, two sets of data can be downloaded: 1. Five individual spreadsheets extracted from the EFSA microstrategy tool providing for all compounds: a. substance characterisation, b.EFSA outputs, c.reference points, d.reference values and e.genotoxlcity. 2. The full database. OpenFoodTox contributes actively to EFSA'S 2020 Science Strategy and to the aim of widening EFSA'S evidence base and optimising access to its data as a valuable open source database that can be shared with all scientific advisory bodies and stakeholders with an interest in chemical risk assessment. In addition, OpenFoodTox has been submitted to the OECD'S Global Portal to Information on Chemical Substances (eChemPortal) so that individual substances can be searched as part of the national and international databases. Further description and associated references are described in the EFSA journal editorial (Dorrie et al., 201 7). h eurovoc domains Agriculture, fisheries, forestry and food Resources ± DOWNLOAD ,L DOWNLOAD Access to the database Download full database '"ag EXCEL XLSJ Documentation ,L download Information about the database |"hIUl "I 80 3. Recommendations for documenting data URI http: / ldata.europa.eu /88u ldataset lopenfoodtox -efsa-s -chemical- hazards -database IdenMer 10.5281/zenodo.780543 DOl 10.5281/zenodo.780543 Landing Page https://doi.org/10.5281/zenodo.780543 Release Date 2017-09-27 Modified Date 2020-04-15 Version Version 3 10.5281/zenodo,3693783 Version notes This version replaces version 2 to include EFSA Opinions, Statements and Conclusions up to November 2019 P h 81 :113. Recommendations for documenting data If there are multiple versions of a data set, the landing page should point to the latest version of the data. SExamPle The screenshot below shows a data set with the dcat.'landingPage property set. EU Veterinary Medicinal Product Database < h Description The EU Veterinary Medicinal Product Database is intended to be a source of information on all medicinal products for veterinary use that have been authorised in the European Union and the European Economic Area. The database is hosted by the European Medicines Agency. h eurovoc domains Science and technology Resources download EU Veterinary Medicinal Product Database ' html URI http:l /data.europa.eu/88u /dataset/eu -veterinary-medicinal-product-database Landing Page http:l lvet.eudrapharm.eu/vet/ l" Release Date 2016-11-21 Modified Date 2019-01-09 Geographical coverage Romania, Slovakia, Slovenia, Sweden, Malta, Netherlands, Poland, Portugal, Belgium, Austria, Cyprus, Bulgaria, Germany, Czechia, Spain, Denmark, Finland, Estonia, United Kingdom, France, Croatia, Greece, Ireland, Hungary, Lithuania, Italy, Latvia, Luxembourg Language English Version 1.2.0.0 82 3. Recommendations for documenting data Even if data sets are expressed in different file formats, they are still manifestations of the same work. A new data format of a data set should be released by adding a new distribution to the data set and changing the minorversion number. For any changes tothe data itself a new majorversion of the data set should be created.ln any case, it is im portant to update the dct:modi/ied and owL'version/nfo properties. SExamPle The screenshot below shows a data setwith data being published in multiple formats. Quiet areas in Europe < h Description The quietness suitability index (QSl) provides the overview with the highest (QSI I) and lowest (QSl D) proportion of potential quiet areas in Europe. h eurovoc domains Environment Resources ± DOWNLOAD ± DOWNLOAD DOWNLOAD ESRI File Geodatabase (zipped) OCTET STREAM GCOTIFF (ziPped) octet sTREAjf older version - 2019-04-02 jHTMLj URI http: //data.europa.eu/ 88u /dataset/ DAT- 209-en Identifier DAT-209-en Landing Page https:l LL The changes made to a data set can be documented using a changelog - a text file that contains a list of the changes made between versions of a file (or multiple files) in a structured and chronologically ordered way. Keywords like 'added: 'changed' and 'removed' help distinguish the types of changes made. One standard of structuring a 83 |@' :113. Recommendations fordocumenting data changelog is called Keep a Changelog ("), which uses Markdown (") forformatting. A command line tool is available for managing changelogs in this way ("). WExamPle The screenshots below show an example of a changelog formatted using Markdown. The raw text file is depicted on the left. The Markdown has been rendered using the Dillingerordine tool ('"), as can be seen on the right.The example features the semantic versioning mentioned above. # ChangeLog ## 1.1.1 (202O-O3-08) **Fixed:** * Hotfix for OpenApi Yainl file load * Close models and data sets after use **Removed:** * Deprecated API endpoints ChangeLog 1.1.1 (2020-03-08) Fixed: · llotfix for OpenApi Ymnl fik load · CiQsc mcdck and dnuisas Utcr Removed: · Depmmrd API 1.1.0 (2020-03-06) Addedi · Coahguratbn QC vertick instancesand wurker m sUe Helpful links and tools Title Dillinger DoltHub Description Link Online Markdown editorwith preview https:,','dillinger.io/ (open source). Version controlfordatabases (commer- https://\/,/\/,'\/\'.do|thub.com/ cial / open source). (") https://keepachangelog.com/en/1.0.0/ (") http5://daringnre|ja||.neL/[jr'jject5/markdown/ (") https://gitl1ul).com/cl]urc|]too|s/c|]ange|ogger (") Rendered using https://dilhnger.io/ 84 3. Recommendations for documenting data W 3.4.5. Release one data set per table For tabular data, each sheet should be published as a new data set. This maintains a clear distinction between data and makes the data easier to process. Some formats, like CSV, do not even feature the concept of multiple tables perfile. WExamPle A statistical data publisher's policy is to publish one (major) data set pertable. Data sets are updated twice a day, at 11.00 and 23.00. As statistics are updated on continuous basis, the publisher provides only one access URL referring to the last update of the data set. The same data set is expressed in different file formats (manifestations) without any difference between their actual content. Each data set: ·is identified bya unique identifier; · is supplemented by reference metadata describing the statistical concepts and methodologies used to collect and generate the data and providing information about data quality; · has machine-readable (SDMX) and human-readable (HTML) documentation; · provides a linkto the landing page of the product data set in the data provider website. b 85 :113. Recommendations for documenting data The screenshot below shows a data set with a unique identifier, machine- and human-readable documentation and a landing page. h Overcrowding rate by income quintile - total population - EU-SILC survey Q < h Description Overcrowding rate by income quintile - total population - EU-SILC survey h eurovoc domains Health, Education, culture and sport, Population and society Resources ± DOWNLOAD £ DOWNLOAD ,L DOWNLOAD Download dataset in tsv format (unzipped) |T5V) Download dataset in tsv format I ZiP| Download dataset In SDMX-ML ibnnat ! zip' ,£ DOWNLOAD ESMS metadata (Euro-SDMX Metadata structure) HTML HTTP //PUBLKATIONS.EURQPA£U/RESOURCE/AUTHORIW/FILE-TYK/OP_DATPRQ C visit page ESMS metadata (EWO-SDMX Metadata structure) SDMX HTTP //PUBLKATlONS.EUROPA£U/RESOURCE/AUTHOWTY/HLE-TYPE/OP.DATPRO C visit page More information on Eurostat Website HTTP //pUuKATIDNs.EURopA.EU/REmURcE/AwHoww/F|LE-wK/op.MTpRo URI http://data.europa.eu/88u/dataset/v1eTqgw2eZCm65KQZo0xUg Identifier ilcjvhoO5q Landing Page http:l lec.europa.eu /eurostat/web/ products -datasetsl - lilcjvhoOSq l" 86 3. Recommendations for documenting data d 3.4.6. Deprecate old versions If new, updated versions of data are published, the older versions should be marked as deprecated and the new version should be linked to from the deprecated version. This allows users to quickly identify old data and subsequently find the newest data. WExamPle Predict includes statistics on 1CTindustries and their research and development in Europe since 2006. It is published on a yearly basis,with one data set peryear. As soon as the latest version is published the previous version is deprecated, and a link referring to the updated data setis added in the description, as shown in the screenshot below. [DEPRECATED] 2018 PREDICT Dataset < b Description NOTE: The 20/8 PRED/CTDataset has been deprecated, and it is now superseded by its latest edition - 20/9 PRED/CTDataset http:l /data.europa.eu/89h/6c6f7ce7-893b-48e9-bO74-2baaa4b6c7d8 PREDICT includes statistics on ICT industries and their R&D in Europe since 2006. The project covers major world competitors including 40 advanced and emerging countries - the EU28 plus Norway, Russia and Switzerland in Europe. Canada, the United States and Brazil in the Americas, China, India, japan, South Korea and Taiwan in Asia, and Australia -. The dataset provides indicators in a wide variety of topics, including value added, employment, Iabour productivity and business R&D expenditure (BERD), distinguishing fine grain economic activities in ICT industries (up to 22 indNidual activities, 14 of which at the class level, Le. at 4 digits in the ISIC/NACE classification), media and content industries (IS activities, 11 of them at 4 digit level) and at a higher level of aggregation fcir all the other industries in the economy. It also produces data on Government financing of R&D in ICTS, and total R&D expenditure. No«asting of more relevant data in these domains Is also performed until a year before the reference date, while time series go back to 1995. ICTS determine competitive power in the knowledge economy. The ICT sector alone originates almost one fourth of total Business expenditure in R&D (BERD) for the aggregate of the 40 ecmomies under scrutiny in the project. It also has a huge enabling role for Innovation in other technological domains. This is reflected at the EU policy level, where the Digital Agenda for Europe in 2010 was identified as one of the seven pillars of the Europe 2020 Strategy for growth in the Union; and the achievement of a Digital SIngle Market (DSM) is one of the IQ political priorities set by the Commission since 201 5. 'L 87 >3. Recommendations fordocumenting data 3.4.7. Link versions of a data set New versions or adaptions of a data set should use the dct.'isVersionOf property to link to other versions of the data set. However, the property dct.'source should be used to linkto the original data set. Since this relationship is bidirectional, the original data set can use the dct.'hasVersion property to link to the new data set. Example original data set This screenshot shows a data set referencing a different version using the property dct.'hasVersion. Note the use of the adms.'versionNotes property giving a description of the current version. Covemnlent Data lnitia1 release Example derived data set This screenshot shows a data set referencing the original version it has been derived from using the property dct:isVersionOF Cenera1 Government Data Example This screenshot shows a data set which links to the original source and parent data set. This is done in both the description and the metadata properties. Average Revenue per User (ARPU) in the Retail Mobile Market < h Description Total retail mobile revenues divided by number of active SIM cards Original source Electronic communications market indicators collected by Comm isslon services, through National Regulatory Authorities, for the Communications Committee (COCOM) - january and july reports.: http:l /ec.europa.e u /digital-agenda/ about-fast-and-ultra-fast-internet-access 88 3. Recommendations for documenting data W Parent dataset This dataset is part of of another dataset: http:l ldigital-agenda-data.eu /datasets/digital_agenda_scoreboard_key_indicators h eurovoc domains Science and technology, Economy and finance URI http://data.europa.eu/88u/dataset/NaUjDKauIkIWXOYFtDz86Q /dent//ier motLarpu Alternative Title Average Revenue per User (ARPU) in the Retail Mobile Market Landing Page http: /' lsemantic.digital-agenda-data.eu /codelist/indicator/ mob.arpu l' Type of Dataset Statistical Release Date 2014-05-22 Modified Date 2015-07-27 Temporal Coverage From 2010-01-01 Geographical Coverage Slovakia, Slovenia, Sweden, Netherlands, Poland, Portugal, Romania, Belgiu m, Austria, Cyprus, Bulgaria, Germany, Czechia, Spain, Denmark, Finland, Estonia, United Kingdom, France, Hungary, Greece, Italy, Ireland, Luxembourg, Lithuania, Malta, Latvia Language English Cata/ogue European Union Open Data Portal source http:l /'cc.europa.eu,'digital-agenda /about4ast-and-ultra-fast-internet-access is part of http:j ldata.europa.eu /88u /dataset/digital-agenda-scoreboard-key-indicators 89 :113. Recommendations for documenting data Helpfullinks and tools Title Data Versioning WG Research Data Alliance best practices Description Research Data Alliance working group (open source). Principles and best practices in data versioning forall data sets, big and small (open source). Link https://www.rd-dliance.org/groups/ data-versioning-wg https://www.rd-dliance.org/group/ data-versioning-wg/outcomes/ principles-and-best-practices-data- versioning-all-data-sets-big ·b 90 4. Recommendations for improving the openness level W 4. Recommendations for improving the openness level Introduction The objective of this section is to help data publishers achieve the highest possible openness level for their data, with a special emphasis on the publishing phase of the data preparation process (see Figure 9). Profiling Distilling Enriching I Documenting Validating I Publishing ·Understanding data ·Exploring data (values) ·Refining data ·Augmenting data ·Structuring data ·Cleansing data ·Data usage ·Assessing data recommendations quality ·Versioning ·Publishing in open, machine-readable formats Data preparation process > Figure 9. Data preparation process - Publishing In Section 4.1 the five-star model for measuring openness of data is introduced. The following sections contain recommendations on how to achieve each level of the model. 4.1. Five-star model Openness is of particular importance when publishing data. It directly affects users' ability to reuse and process data, and thus the value of data. In this section, openness is discussed with regard to file formats. Tim Berners-Lee's five-star model ("), which was developed in 2001, is an attempt to provide a scale for measuring the openness of data. Data can achieve a maximum of five stars, indicating the highest level of openness. The ranks are cascading, meaning that in order to comply with a certain rank, the criteria of the preceding ranks must also be met. Regardless of actual data quality, the first star is awarded for using an open licence. If data usage is restricted by a proprietarylicence its quality is rendered meaningless. In orderto achieve a second star, the chosen file format must be (semi ) (u) I' [ '· ,' "; · ,'[(|,' ,' if'l ' 91 34. Recommendations for improving the openness level structured. A table stored as CSV is much easier to process than an image in which a table is depicted. Next, usage of non-proprietary formats is required for a three-star rating. Using URIS as identifiers for resources is required fora four-star rating.The decisive characteristic for achieving the full five stars is linking data together to provide context. An illustration of this hierarchy is shown in Figure 10. The following sections contain recommendations for acquiring all five stars. DL RE OF 4 . RDF . ~1" :>- ' ) g ® '"= @ | Figure 10. Cascading steps of the five-star model with exemplaryfile formats Source: https://5stardata.info/en/ 4.2. Use structured data (one two stars) As mentioned above, the first star is awarded for using an open licence. To achieve a two-star rating, data must be structured. Table 5 in Section 4.6 gives an overview of the common formats and indicates whetherthey are machine readable or not. Based on this, the recommended formats for data publishers are ROE XML, JSON and CSV. Section 1.2 describes how to achieve well-structured data in these formats. Recommendations are given on how to construct well-formed files, as well as an overview of tooling support. Example This screenshot shows a data set which contains both PDF and XLS files. PDF is a format suitable for human reading. However, data publishers should make sure thatthey also publish their data in a machine-readable formatto enable othersto easily process the data.To achieve a two-star rating, data must be published in a machine-readable format (or any other structured data format). 92 4. Recommendations for improving the openness level < She Figures 2015 - Gender in Research and Innovation 0 < h Description She Figures 201 5 investigates the level of progress made towards gender equality in research & innovation (R&1) in Europe. It Is the main source of pan-European, comparable statistics on the representation of women and men amongst PhO graduates, researchers and academic decision-makers. The data also sheds light on differences in the experiences of women and men working in research - such as relative pay, working conditions and success in obtaining research funds. It also presents for the first time the situation of women and men in scientific publication and inventorships, as well as the inclusion of the gender dimension (I) in scientific articles. This compendium is produced in cooperation with Member States, Associated Countries, and Eurostat, Further data sources are: Web of Science, European Research Area Survey 2014, h eurovoc domains Health, Education, culture and sport, Science and technology, Population and society Resources DOWNLOAD She Figures 2015 I poFj download She figures 2015 - data i¶le ek:el xis Helpful links and tools Title Description Link POFtoXLS PDFTables Coenterprise tableau Free online tool for extracting tables from PDFinto XLS ¢9es (open source). https://pdftoxls.com/ Paid onlinetod with an APlforextract- https://pdftables.com/ ingtablesfrom PDFinto XLS,CSV, XML or HTML®es (commercial). ETL suite thatsupports PDF content https://www.coenterprise.com/ extraction into CSV ®es duringthe data solutions/data-a nalyticsl preparation phase (commercial). 4.3. Use a non-proprietary format (two three stars) Using a machine-readable format is key to achieving a high openness level. However, some formats, like XLS, are proprietary, which means that a certain piece of software - in this case Microsoft Excel - is needed to fully process the file. Often, this kind of software is not freely available. As accessibility for everyone is a core principle of open data, proprietary file formats are not the correct choice. Thus, to receive the third star, a non-proprietary file format such as ODS must be used. Table 5 in Section 4.6 gives an overview of which formats are non-proprietary. 93 W 4. Recommendations for improving the openness level WExamPle This screenshot shows tabular data in ODS format, opened in the non-proprietary application LibreOffice. a LmOc¥ke Ctk E9· Fa ~ sitµs ~ Qw kdi 4'~ Np "·.."E]- Y b,.. A · " " " B I U A""' " ." " -, Eu " Ik = A B ' City Population size ? Berlin 3,669,491 ) London 8,908,081 · Paris 2,187,526 $ q q ' ' + SMRU ~JH1 rvpl ~QUR - D X X ·A"y %UZ"tj?iiU3 QthO » " · ' -" 'm%®tM®.® 'EE » ' _1 i "I" m A~9mkmO 0 0 ?50% SExamPle 1 City:Population size 2 Berlin;3669491 3 London;8908081 4 Paris;2187526 This screenshot shows tabular data in CSV, an open text-based format. Helpful links and tools Title Libre0®ce 0penOEIce Microsoft 01Bce 0nly0® cc Recommended formats Description Open-source OIB cc suite supporting OpenDocument formats (open source). Open-source O® cc suite supporting OpenOocument formats (open source). Proprietary O® cc suite which supports OpenOocument formats from the 2013 version (commercial). Desktop and web-based collaborative O® cc suite (commercial). List of open formats recommended by the UK Data Service (open source). Link https://www.libreo® ce.orgl http://www.openo® ce.orgl https://www.o® ce.coml https://www.onlyo® ce.com/en/ https://www.ukdataservice.ac.uk/ ma nage-data/format/ recommended-formats 94 4. Recommendations for improving the openness level W " 4.4. Use URIS to denote things (three four stars) Three-star data is easily processable, but isolated and hard to reference by others. In order to achieve a four-star rating, URIS must be used to denote things. Of course, the file itself should also be resolvable by a URI. The recommendation in this section focuses on using URIS in the data itself. 'Things' refers to resources or concepts within the data. For example, a city would be a concept that could be denoted by the URI , instead of the plain identifier'Berlin'. In contrast, numbers, such as a population size, do not need to be denoted as URIs.Things not considered a resource are called 'literaKthe difference being that literals only acquire meaning when used in conjunction with resources. Numbers, Boolean values (true and false) and dates have little meaning on their own and are thus literals. RDF graphs are made up of triples, consisting of a subject, predicate and object. Subjects and predicates must always be resources, whereas objects can either be resources or literals. In orderto replace identifiers with URIs, a first step can belooking at existing controlled vocabularies and knowledge bases to see if the concepts already have widely adopted URIs.These are covered in the next section. If none exist, the authority publishing the data can publish its own ontology in order to define concepts that have not been specified elsewhere. Example The first triple (yellow) consists of only resources, whereas the second triple (green) contains a literal (the population number). They could be read as 'Berlin is in Germany' and 'Berlin has the population size 3 669 491' respectively. 3669491 The triples that make up RDF graphs are stored in dedicated databases called triple stores.They can then be queried using SPARQL, a query language similarto SQL. URIs can not only be used in RDF files though. All formats in which resources and concepts are denoted by an identifier can make use of URIs. WExamPle This screenshot shows the city population CSV file from earlier. Here, the city names have been replaced with referenceable URIs. 1 City:population size 2 "lttp: //cjtjes,,org/Rer" in 3 h1"tp" //cjtjes nrg/T,nncl')n 4 http: //cjtjes.,org/paris,: ;3669491 ;8908081 ?1875?6 95 W 4. Recommendations for improving the openness level URIs should be unique on the web. This means that if two pieces of data have the same URI, they mean the same thing. Additionally, using URIs allows other data providers to link to the data, which is required for achieving the five-star rating covered in the next section. SExample This screenshot shows the same data as in the CSV example above, albeit as ROE Note that all referenceable data is denoted with a URI (yellow boxes). The only exceptions are the population numbers, which are literals (red boxes) and are not referenceable (and do not need to be). - - - 3669491 - 890808l - 2187526 (411) 1 [' " j| |'ll'[ " " 11 "t' 1' 97 34. Recommendations for improving the openness level Helpfullinks and tools Title Description EU Vocabulariesand Authority EU Vocabulariesand Authority Tables Tables have been developed forthe Publications 0®ce in orderto facilitatethe exchange of data between the di19erent information systems of the EU institutions (legislation, callsfortender,etc.) and describe data sets (open source). OBpedia Linked data version of Wikipedia contents (open source). 0penRe&e Withan RDFpluginthistodcanimport data in formatslike CSV, JSON, and XML and mapthis datato an existing ontology (open source). Cleaning data with 0penRe@e Thisarticle describes howto discover inconsistencies in data and how to diagnose the accuracy of data with 0penRe&@'). Link https://op.europa.eu/en/web/ eu-vocabularies/authority-tabks https://op.europa.eu/en/web/ eu-vocabularies/dcat-ap-op https://wiki.dbpedia.org/ https://openre&e.org/ h https://doaj.org/article/ 3ccd075407a4481c85cOd00d65a003c0 4.6. File formats and their achievable openness level The table below shows a list of commonly used formats along with information on whetherthey are machine readable and proprietary. The right-hand column indicates the number of stars that can be obtained when using this format for data publish- ing.The formats were selected based on the analysis performed in the data profiling phase. Ideally, the formats highlighted in green should be used. lfthis is not possible, formats from the yellow section should be used. Resorting to formats highlighted in red should be avoided, as only a one-star rating can be achieved with these. (") Dc Wilde, M., van Hooland, S. and Verborgh, R.,'Cleaning data with OpenRefine: The Programming Historian, 1 August 201 3, Editorial Board of the Programming Historian, United Kingdom, 2013. 98 4. Recommendations for improving the openness level W Table 5. File formats and their achievable openness level Format Non-proprietary Machine readable Achievable stars RDF Yes Yes OOQOQ XML Yes Yes OOOQ JSON Yes Yes O O O Q CSV Yes Yes O O O O ODS Yes Predominantly XLSX Yes Predominantly XLS No Predomina ntly TXT Yes Predominantly HTML Yes Predominantly PDF Yes No DOCX Yes No DDT Yes No PNG Yes No GIF No No JPG/JPEG No No TIFF No No DOC No No OQQO OOQO O O O O O Q O Q O O O O d 99 |@' PGlossary Glossary ,M Accessibility The degree to which required data can be accessed by data users, possibly including authentication and authorisation. API (application programming interface) An APl is a programming interface. It is provided by a software system and allows other programs to communicate with this system. APls are often provided by data publishers and allow programs or apps to read the data directly overthe web.To do this,the app sends a queryto the APl forthe required data.The advantage of providing data via an APl is that the entire data set does not need to be downloaded - it is possible to provide only the required data. This also ensures that the data is up to date. Array Arrays are list-like types of objects that represent a collection of elements that can be selected by corresponding indices. Attribute In the XML description language, an attribute represents a name-value pair that is part of a day. An attribute can onlyoccuronce perdayand can only contain individual values. Backward compatibility Backward compatibility is the capacity of a hardware or software to interact with data and interfaces from earlier versions of the system or with other systems. Boolean (values/type) Boolean is a data type that can only contain one of the two possible values 'true' and 'false'. camelCase Spaces and special characters can hinderthe automated processing of data.Therefore, it is advisable to group identifiers consisting of multiple words into one.ln camelCase typography, the first character of each word is capitalised, except the first one. This is independent of the type of word. Character encoding Character encoding translates between characters and bytes through an encoding system. Client A client may be understood as an instance consuming data and can be a person or a computer. Typically, the client requests resources from a server. For example, a browser 100 I I Glossary j¢ a loading a website would be considered a client, with the website being provided by the server. CSV (comma-separated values) CSV is a standard format for structured data. Because of its simplicity, openness and machine readability, CSV is often used for publishing open data. data.europa.eu The official portal for European Union data providing a single point of access to open data from international, EU, national, regional, local and geo data portals (https:h'data. '""""1'".""/'""). Data blending Data blending is the process of merging data from different sources into one functioning data set. Data catalogue A data catalogue combines metadata with data management and search tools to improve data findability and to serve as an inventory and overview of possible uses for data. Data cleansing Data cleansing or data cleaning is the process of detecting and removing incorrect and/or inconsistent data from a record set. Data preparation Data preparation is the process of collecting, cleaning and consolidating data to create a consistent data set that can be used for analysis. Data provider The data provider is defined as the entity that provides content via a platform accessible to users. Decisions on the publication, terms of use and formats reside with the data provider. Data set A data set is a quantityof data that is related in content.A data set usually contains one or more resources, for example covering different formats, and metadata describing the content of the resources. Data user Data users are natural or legal persons who are entitled to use the data provided by the data provider for their own purposes and who are responsible for doing so in accordance with the conditions of use. DCAT-AP (Data Catalogue Vocabulary Application Profile for Data Portals in Europe) DCAT-AP is a standard based on the DCAT developed by the W3C and used for defining and structuring metadata for data sets from public authorities. It defines metadata fields and ranks them by importance, i.e. mandatory, recommended and h 101 |@' PGlossary optional. For example, data sets must have a title, but providing a version is optional. For the greatest level of compatibility with users this standard should be followed as closely as possible. DMP (data management plan) A DMP is a written document that specifies what data is expected to be produced or acquired in a research project, how large the data set will be, how it will be analysed and described, how it will be stored and how it will be published and preserved. Element In XML,an element is a field containing data.An elementis defined using tags and can also contain attributes. Endpoint An endpoint is a remote computing device that interacts with a network to which it is connected. Examples of endpoints are desktops, laptops and smartphones. Endpoints are vulnerable to cybercriminal activity. Escaping Escaping means making characters usable in data that are otherwise reserved forfor- matting. It is done by replacing the characters with specific codes. Without escaping, these characters would be interpreted as markup, which could break syntax validity. EU ODP (European Union Open Data Portal) Up until 21 April 2021 (when the European Data Portal and the European Data Portal were consolidated to become (kta.el|ropa.el| - see glossary entry above), the EU ODP provided, via a metadata catalogue, a single point of access to data from the EU institutions, agencies and bodies for anyone to reuse. Findability The degree to which metadata and data is easy to find for humans and computers. FAIR principles The FAIR principles for scientific data management and stewardship published in Scientific Data (") aim at enhancing the findability, accessibility, interoperability and reuse of digital assets. GET request In HTTP a GET request is a method for requesting a resource from a server. Header The term header refers to supplementary information of a file or protocol. For example, in CSV files a header line indicates variable names (and type/format if applicable) to be found in each column.ln HTTP, headers allow a client or serverto transmit supplementary information with a request. (") Wilkinson, M., Dumontier, M., Aalbersberg, I. et al,'The FAIR guiding principles tor scientific data management and stewardship: ScientificOata,Vo1. 3, Article No 160018, Macmillan Publishers Limited, 2016 (I'. 1" ·: 1(.. |..l'.(' ' 1.:','1'.). 102 I I Glossary j¢ a HTTP (hypertext transport protocol) HTTP is one of the core technologies of the internet. It defines methods and status codes used for sending data between clients and servers. ID An ID is a unique identifier for a related set of data. Consecutive numbering is often used forthis purpose. A URI is also a kind of ID. Inspire The infrastructure for spatial information in the European Community (Inspire) is an initiative of the European Commission that aims to create a European spatial data infrastructure for the purposes of a common environmental policy. Interoperability The degree to which data can be integrated with other data and interoperates with applications or workflows for analysis, storage and processing. JSON JSON is a powerful format that is well suited to data exchange between different applications. It can handle complex data structures, is easy to read for both humans and machines and is independent of platform and programming language. Literal In the context of RDF, a literal denotes a simple data value. Only RDF objects may be literals. Unlike RDF resources, these are not encoded with a URI and thus cannot be referenced from outside their 'own' triple. Literals are often used for data that loses its meaning outside its own triple, for example people's names. Machine readability In principle, all data that can be interpreted by software is machine readable. In the context of open data this usually means data formats that enable further processing. The underlying data structure and corresponding standards must be publicly available and should befully published and availablefree of charge. Masking Masking means hiding characters in data that may otherwise be interpreted incorrectly. For example, if commas were used as separators in a CSV file, commas in the data themselves would need to be masked. Metadata Metadata is used forthe acquisition and description of a data set in a structured form. For example, metadata contains information about the content, title or format of a record. In short, metadata is data about data or references to the actual data. Metadata usually follows a certain schema which provides mandatory and optional information aboutthe data set. h 103 |@' ,= :II Glossary d Namespace Namespaces are used to prevent name conflicts in projects by ensuring that objects have unique identifiable names. Null value A null value indicates the complete absence of data.This should not be confused with an empty character string orthe numeric value 0, since these contain actual information. A null value is therefore ratherto be understood as an unknown value. Payload A payload is the transmitted data that contains the actual content. Metadata and HTTP headers (if applicable) are not part ofthe payload. PascalCase Spaces and special characters in identifiers can complicate data processing. If identifiers consist of several words, it is recommended that words be combined into one. In PascalCase notation the initial letters of each word are capitalised to facilitate human readability. This happens irrespective of word class, i.e. even verbs and adjectives begin with a capital letter. RDF (resource description framework) RDF is a model for storing data and metadata. It stores linked data in the form of triples. Resource In the context of RDF, a resource is defined as a data unit that can be related to other resources. A resource is usually unambiguously referenceable. The subject and predicate are resources and the object can be either a resource or a literal. Resource ID A resource ID or resource identifier is typically a string of characters used to reference and identify a resource. Reusability The degree to which data is optimised to be reused forreplication and/orcombination in a different setting. Reusability is achieved through well-specified metadata and data. Server A server provides data. Clients can send a request to the server, upon which the requested data is sent back to the client. For example, a website residing on a server on the internet can be loaded by a browser, i.e. the client. Status code, HTTP An HTTP status code is a standardised numeric value that provides information about the success of an HTTP request.Allvalues within certain numberranges have a similar meaning, while the concrete numbers give a more precise differentiation. All codes in the range from 400 to 500 indicate errors on the client side. For example, code403 104 Glossary < shows that the request was not authorised, while 404 indicates that a resource is not available. String A string is a data type that is used to represent text. It includes characters and can include spaces and numbers. Tag In XML, a tag is the designation of a data unit. A keyword enclosed in arrow brackets marks the opening tag ().The same keyword preceded by an arrow bracket and a slash and a closed by an arrow bracket marks the closing tag (). Triple In RDF, atriple is the combination of a subject, a predicate and an object.This combination represents a unit of meaning.ln RDF data is always stored in the form of triples.The corresponding database is called a triplestore. URI (uniform resource identifier) A URI is a unique reference to a resource. It can consist of letters and/or numbers; spaces are not allowed. A URI can point directly to the location of the resource, for example when using a network address (URL). U RL (uniform resource Iocator) A URL is a subtype of URI. In contrast to a URI, a URL always points to a resource that can be found, so it is both identifier and address atthe same time.lnternet addresses or email addresses are URLs,for example. UTF-8 UTF-8 is a widely used way of representing characters. Especially in connection with special characters, this type of storage ensures the greatest possible compatibility with other programs. It is the encoding of choice on the web. Validator A validator checks the syntactical correctness of code. W3C The World Wide Web Consortium is an international community for standardisation on the World Wide Web. XML (Extensible Markup Language) XML is a file format used for storing hierarchically structured data. It was designed to be machine readable and readable by humans. eL" 105 Overview of quality indicators and metrics Overview of quality indicators and metrics 106 FAIR Indicator Description Metric dimension Can be aggregated Datal QN/QL(*) throughoutseveral metadata data sets? Calculation Used by Relevance other ranking portal? Findability Completeness The data is complete if it includes Numberof nullvalues Yes Data QN Percentage — Medium all items needed to representthe Numberofempty&ldsin Yes Metadata QN Percentage — Medium entity.Often related to nullvalues in literature. Atthe metadata level, metadata completenessindicates how much Data set identi&r resolves Yes Metadata QN Binary — Medium meta information is available toa digitalobject forthe given data set. Metadata should describethe resource as fullyas possible. Findability Data sets should be discoverable for Keywords assigned Yes Metadata QN Binary EDP Medium both humansand computers. The &dabilityof a data set dependso, Categonesassigned Yes Metadata QN Binary EDP Medium the description inthe metadata: Temporal information Yes Metadata QN Binary EDP Medium the betterthe data isdescribed, e.g. given throughthe usage of controlled vo cabularies and keywords,the easier Spatial information given Yes Metadata QN Binary EDP Medium itisfor usersto Ehd the data. Linkto otherdata Yes Metadata QN Binary GARDIAN Low Accessibility Accessibilityl Accessibilitydescribeswhether Access URLaccessible Yes Data QN Binary EDP, High availability thecontentofthe portalorthe GARDIAN resourcescan be retrieved by a human orcomputerwithout Landing page accessible Yes Data QN Binary — Medium any errors oraccess restrictions. Download URLgiven Yes Data QN Binary EDP, High Accessibilitycan bedistinguished GARDIAN intwoways.Fora human reader,the main issue iscognitive Download URLaccessible Yes Data QN Binary EDP High accessibility. Fora computer, the Downloadable without Yes Data QN Binary — Medium main issue is physicalaccessibility· registration Accessauthorisation Yes Metadata QN Binary EDP High information given Usage of controlled access Yes Metadata QN Binary EDP Medium rightvocabulary D O —h .Q C ~' m 5 Q. A' QJ ~ O a Q) = Q. 3 (D ~ m A' Vi D N ^ FAIR Indicator Description Metric dimension Can be aggregated Datal QN/QL(*) throughoutseveral metadata data sets? Calculation Used by Relevance other ranking portal? Interoperability Conformityl The data and metadata conform DCAT-AP compliance of Yes Metadata QN Binary EDP, Medium compliance iftheyfollowaccepted standards, metadata GARDIAN e.g. forcapture, publication and description. An examplecould be Conformity of EJe Yes Data QN Binary — Low formatsand licences the conformity of certain metadata values(ljRLs,emails), butalsothe Conformity of accessto Yes Metadata QN Binaryl — Low overall complia rice of the metadata propertyvalues percentage with DCAT-AP. Valid date formats withinthe data or metadata also Conformity of date Yes Both QN Binaryl — Low indicate conformity. formats percentage Conformityof email Yes Both QN Binaryl — Low addresses percentage Conformity of licences Yes Metadata QN Binary — Low Characterencoding Yes Data QN Percentage — Low issues Data following a given Yes Data QN Binary — Low schema Machine Thisindicatorassessesthe extent Processability of Ele Yes Data QN Binary EDP Medium readabilityl to which the data and metadata format and mediatype processability are machine interpretable, i.e.the Usage of controlled Yes Both QN Binaryl EDP Medium extentto whichthey can be under- vocabulanes percentage stood and handled byautomated processes. Openness The opennessofdata is of crucial Openness ofWeformatand Yes Data QN Binary EDP, Medium relevancefortheconcept of open mediatype GARDIAN data®). Data isconsidered to be , , , , Licence mformation given Yes Metadata QN Binary EDP, High open ifthe resources are available ina non-proprietaryformatandcan GARDIAN be used underan open licence. Openness of licence Yes Metadata QN Binary — Medium Correctnessof licence Yes Metadata QN Binary EDP Medium \F O a < (D' O —h C Q) = < 5 Q. A' QJ ~ O m V) QJ = Q. 3 cD ~ m Tb' Vi (") Sunlight Foundation (201 7),'Ten principles for opening u p government information'(https://su n|ightfoundation.com/po|icy/documents/ten-open-data-princip|es/). FAIR Indicator Description Metric dimension Reusability Timeliness Metadata and data aretimely if they are up to date and represent theactualand currentsituation. This meansthatassoon asa change occurs inthe real world,the data and metadata haveto be modi&d too. However, the assessment of timeliness of data is nottrivial as it is hardto automatically understand from the content if it is historical or real-time data. Thus, it is not easyto tellthe requirements of timeliness in an automated way. Data and metadata are consistent ifthey do not contain any contradictions. Examples of contradictions would be a data setcontaining multiple and contradictory licence statements or modi®cation datesthatare earlier than creation dates. Contradiction might especially occur if data is combined from dikrentsources. Update information given Creation date given Modi&ation date given Temporal information given Can be aggregated Datal QN/QL(*) Calculation Used by Relevance throughoutseveral metadata other ranking data sets? portal? Yes Metadata QN/QL Binary — Medium Yes Metadata QN/QL Binary EDP Medium Yes Metadata QN/QL Binary EDP Medium Yes Metadata QN/QL Binary EDP Medium Consistency Numberof non-admissible Yes values Semantic dista nce Yes Compliance with community Yes standards Freeness from duplicates Yes D YD Both QN Binaryl — Low percentage Metadata QN Percentage — Low Both QN Binary GARDIAN Low D Data QN Binaryl — Low 2 cD' percentage :2 O —h .Q C ~' m 5 Q. A' QJ ~ O a Q) = Q. 3 (D ~ m A' Vi ^ FAIR Indicator Description Metric dimension Reusability Accuracy Metadata isaccurate if the descri ption of the content is as precise as possible,sothat potential users get a realistic idea ofthedataandareabletoquickly assess itsrelevance fortheir own contexts. Althoughthisdepends on the user's perception,there are some metadata valuesthat can be checked automatically intermsof semantic accuracy: information given aboutt9e formatand content size can be compared with the actual&e format of the resource and its real-world size. Data is only of use if it isrelevant and of interesttothe potential user. Thus,the data setshould only contain the information necessary to supportthetaskathand. Relevance describesthe extent to which the data is helpfuland applicable, and the extentto which the amountof data isappropriate. Thisindicatoris highly dependent on the user's perceptionandthe task at hand. Fileformataccuracy Can be aggregated Datal QN/QL(*) throughoutseveral metadata data sets? Yes Metadata QN Content size accuracy Yes Metadata QN Percentage of accurate cells Yes Data QN Calculation Used by Relevance klr other ranking portaP o Binaryl — Low percentage O —h Binaryl — Low percentage = < Percentage — Low 5 Q. A' QJ ~ O m V) QJ = Q. 3 cD ~ m Tb' Vi Relevance Appropriate amount of data No Data QL — — Limited FAIR Indicator Description Metric dimension Reusability Understandability Data and metadata are understandable iftheyare clearand comprehensibletothe user. Afterstudyingthe data and metadata, no ambiguitiesshould remain. Description of data given Title given Keywordsassigned Docu mentation of data given Can be aggregated throughoutseveral data sets? Yes Yes Yes Yes Datal QN/QL (*) Calculation metadata Metadata QN/QL Binary Metadata QN/QL Binary Metadata QN/QL Binary Metadata QN/QL Binary Used by Relevance other ranking portal? Low Low EDP Low GARDIAN Low Credibility Thisindicatoris highly dependent on the user's perception andtheir expertknowledge inthe domain concerned.The understandability rating may increase if certain contextual information is provided, such as a description ofthe data, a title and keywords. However, in the end itdependsonthe userwhether the data isactudlycomprehensible or not. Data is considered credible if it is based ontrustworthysources. Credibilitydescribestheextentto which 'data hasattributesthatare regardedastrue and believable by users' (44), Thus,thisindicator ishighly dependenton the user's perception. Still,the credibilityandtrustwor- thiness of the data may increase if certain contextual information is provided,such asinformation about the original publisher,thecontact pointandthe data setowner. Contact point given Yes Metadata QN/QL Binary EDP Low d Data set publishergiven Yes Metadata QN/QL Binary EDP Low 2 (D' Data set creator given Yes Metadata QN/QL Binary — Limited O —h .Q C ~' m 5 Q. A' QJ ~ O a Q) = Q. 3 (D ~ m A' Vi ^ (") lso25000.com, 'ISO/IEC 250 12: Quality of Data Product' (hup://iso2 5000.com/index.php/en/iso-25000-standards/iso-250 127|imit=5&|imitstart=O). (*) Quantitative/qualitative Checklist for publishing high-quality data Define data ·Data characteristics: structure, modelling management ·Data documentation plan ·Data quality assurance ·Data sharing ' t ·Data storage and dissemination C Assess data quality ·Validate and clean yourdata > Remove duplicates > Mark nullvalues ·Standardise your data > Format date, time, numbers " " ' ' > Check character encoding Document your data ·Documentschemas and data models ·Document data changes and versions ·Documentsemantics of data Make your data FAIR(') ·Document data validation report , ·Document validation rules Enrich your data ·Reuse conceptsfrom controlled vocabularies ·Harmonise labels ·Dereference translation of labels ·Linkand augmentyourdata Publish your data ·0ffer directaccess (accessible download URL) ·Use machine-readable open formats ·Publish underan open licence ·Publish documentation about your data '" " ' " ' ·Use URlsand linked data Facilitate data discovery ·Describe your data with rich metadata 112 (') https://www.go-fair.org/ fair-principlesl "= List of figures Figure 1. Data preparation process 7 Figure 2. Overview of quality indicators grouped by FAIR dimensions 9 Figure 3. Data preparation process - Validating 10 Figure 4. Magic quadrantfordata qualitytools 11 Figure 5. Blanklines and titles opened in a spreadsheet 32 Figure 6.lnterpretation of blanklines and titles in CSVfiles 32 Figure 7. Data preparation process - Enriching 54 Figure 8. Data preparation process - Documenting 65 Figure 9. Data preparation process - Publishing 91 Figure 10. Cascading steps of the five-star model with exemplary file formats 92 List of tables Table 1.Characters that need escaping in XML 39 h Table 2.Overview of methods and status codes 49 Table 3.Typical headers that are used in conjunction with APls 50 Table 4. Pagination using offset and limit parameters 51 Table 5. File formats and their achievable openness level 99 Table 6. Overview of quality indicators and metrics 106 113 m " Bibliography Auer, S., Lehmann, J., Maurino, A., Pietrobon, R., Rula, A. and Zaveri, A. (2012), 'Quality assessment for linked data: a survey: Semantic Web 1, lOS Press, (http://www.semantic-web journal.net/system/files/swj773.pdf). Batini, C., Cappiello, C., Francalanci, C. and Maurino, A. (2009), 'Methodologies for data q uality assessment and improvement', ACM Computing Survey Vol. 41, No 3, pp. 16-52 (http://dimacs-a|gorithmic-mdm.wdn|es.com/|oca|—h|es/start/Methodo|ogies°/020 for%20Data°/,20Quality%20Assessm ent%20a nd%20lm provem ent.pdf). Canova, L., lemma, R., Morando, F., Orozco Minotas, C., Torchiano, M. and VetrO, A., (2016), 'Open data quality measurement framework: definition and application to open g overn m ent data: Government Information Quarterly, Vol. 33, No 2, Elsevier, pp. 325-337 (|")tt[)s://www.sciencedirect.com/science/artic|e/pii/S0740624X16300132). d ata.europa.eu, Metadata Assessment Methodology (https://www.europeandataportaLeu/mqa/methodologyAocde=en#). De Wilde, M., van Hooland, S. and Verborgh, R. (2013), 'Cleaning data with OpenRefine', The Programming Historian, Editorial Board of the Programming Historian, United Kingdom (https://doaj.org/artic|e/3ccd075407a4481c85c0d00d65a003c0). Duval, E. and Ochoa, X. (2009), 'Automatic evaluation of metadata quality in d igital repositories: lnternationaljournal on Digital Libraries, Vol. 10, pp. 67-91 (https://|ink.springer.com/artic|e/10.1007/s00799-009-0054-4). European Commission (2014), Training Module 2.2. - Open data & metadata quality (https://www.europeandataportal.eu/sites/default/f¶es/d2.1.2 _ training _ module_ 2.2_ open _data _quality en _edp.pdf). European Com mission (2018), Turning FAIR into Reality - Final report and action plan from the European Commission expertgroup on FAIR data, Publications Office of the European Union, Luxembourg (https://op.europa.eu/en/ publication-detaill /publication/7769a148-f1f6-11e8-9982-01aa75ed71a1). Gartner Research (2019a), 'Magic quadrant for data quality tools' (https://www. gartner.com/en/documents/3905769/magic-q uadrant-for-data-quality-tools). Gartner Research (2019b), 'Market guide for data preparation tools' (https://www. gartner.com/en/documents/3906957/market-g uide-for-data-preparation-tools). Hare, J. (2016), 'What is metadata and why is it as important as data itse|f?: opendatasoft (https://www.opendatasoft.com/blog/2016/08/25/ what-is-metadata-and-why-is-it-important-data). lso25000.com, 'ISO/IEC 25012: Quality of data product' (http://iso25000.com/ indexphp/en/iso-25000-standards/iso-25012Aimit=5&hmitstart=O). 114 Kubler, S., Le Traon, Y, Neumaier, S., Robert, J. and Umbrich, J. (2018), 'Comparison of metadata quality in open data portals using the Analytic Hierarchy Process: Government information Quarterly, Vol. 35, N o 1, Elsevier (https://www.scien ced irect.com/science/article/pii/S0740624X16301319). Little, C. (2018), 'The Forrester Wave'": data preparation solutions', Forrester (https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+So|u- tions+Q4+2018/-/E-RES141619). LnCnicka, M. and Mdchovd, R. (2017), 'Evaluating the quality of open data portals on the national level', Journal of Theoretical and Applied Electronic Commerce Research, Vol. 12, No 1, Universidad de Talca (https://scie|o.conicyt.c|/scie|o.php?script=sci _ arttex- t&pid=S0718-18762017000100003). Neumaier, S. (2015), 'Open data quality: assessment and evolution of (meta-)data quality in the open data |andscape: thesis (https://www.data.gv.at/wp-content/ uploads/2016/02/Sebastian _ Neumaier_ MSC _ 2015.pdf). Reiche, K. J. (2013), 'Assessment and visualization of metadata quality for open government data', thesis (https://www.inf.fu-berlin.de/inst/ag-se/theses/ Reiche13-metadata-quality.pdf). Strong, D. M. and Wang, R. Y. (1996) 'Beyond accuracy: what data quality means to data consum ers: Journa/ofManagement Information Systems, Vol. 12, No 4, Spring, pp. 5-33 (httpj/mitiq.mit.edu/Documents/Publications/TDQMpub/14 _ Beyond _ Accuracy.pdf). Sunlight Foundation (2017), 'Ten principles for opening up government information' (https://sun|ightfoundation.com/po|icy/documents/ten-open-data-princip|es/). .7 115 List of topics (section number in brackets) · Make use of tooling whenever possible (I) · Develop a data management plan (I) · Describe your data with metadata to improve data discovery (I) · Mark nullvalues explicitly as such (I) · Publish data without restrictions (I) · Provide an accessible download URL (I) ·ConsiderlSO standards forformatting date and time (I) · Use a dot to separate whole numbers from decimals (I) · Do not use a thousand separator(1) · Make use of a standardised character encoding (I) · Provide an appropriate amount of data (I) · Consider community standards (I) · Remove duplicates from your data (I) · Increase the accuracy of your data (I) · Provide information on byte size (I) · Make use of controlled vocabularies to standardise data (2) · Linkrelevant data sets (2) · Use knowledge bases for enrichment (2) · Use schemas to specify data structure (3) · Document data changes (3) · Use a machine-readable format (4) · Use a non-proprietary format (4) · Consider open standards (4) ·Considerlinked data principles (4) h 116 a iM GETTING IN TOUCH WITH THE EU In person All over the European Union there are hundreds of Europe Direct information centres. You can find the address of the centre nearest you at: https://europa.eu/ european-union/contact_ en On the phone or by email Europe Directis a service thatanswers yourquestions aboutthe European Union. You can contact this service: — byfreephone:00 800 6 7 8 9 10 11 (certain operators may charge forthese calls), — at the following standard number: +32 22999696, or by email via: https://europa.eu/european-union/contact_ en FINDING INFORMATION ABOUTTHE EU Online Information about the European Union in all the official languages of the EU is available on the Europa website at: https://europa.eu/european-union/index_ en EU publications You can download ororderfreeand priced EU publications from:https://op.europa. eu/en/publications. Multiple copies of free publications may be obtained by contacting Europe Direct or your local information centre (see https://europa.eu/ european-union/contact_ en). EU law and related documents For access to legal information from the EU, including all EU law since 1951 in all the official language versions, go to EUR-Lex at: http://eur-lex.europa.eu Open data from the EU The official portal for European data (https://data.europa.eu/en) provides access to datasets from the EU. Data can be downloaded and reused for free, for both commercial and non-commercial purposes. O P 0 0 0 ' P . 0 0 " " 0 ' Q P " . 0

Mar 20, 2022

Data Europe Guidelines