How to ensure datasets are updated
The minimum published data does not change over time. Users are usually most interested in current data. Dataset distributions must therefore be updated regularly according to the update frequency set in the dataset record. Either a distribution that has already been published once is updated, or a new distribution is created containing the update (i.e. newly added records or changes to already published records). The updating process is essential for the realistic usability of the published data. To ensure that it is done correctly, it is necessary to ensure the cooperation of all members of the established team.
Preparing to update a dataset essentially means preparing to distribute the dataset. Preparing the distribution means either creating a new distribution or updating a once published one. Updating does not have to be about the dataset itself. Often the attributes of the dataset record also need to be updated. The curator of the dataset is responsible for preparing the update and will forward the updated distribution to the Data Opening Coordinator along with the prepared dataset record for formal review and to ensure the update. The coordinator will normally arrange the update in cooperation with the IT department staff.
Different ways to update the dataset
All datasets change over time and can change in different ways. For example, in an Excel spreadsheet, individual rows may be added or values in existing ones may change. Updates can therefore be handled in different ways depending on the nature of the changes.
- By creating a new data file with new entries at each update period
- By maintaining only one data file and creating a new one at each update period to replace the previous one
- Creating multiple files for each update
Automated data update
As with creating a new distribution, automation can help significantly. It is in the case of the automation process that the effectiveness of automating data collection efforts becomes apparent. Automation brings significant time savings, but most importantly, stability. Primarily due to minimal error rates (data consistency) and processing speed. Recently, it has become possible to use a number of ways to automate data collection and update distributions.
- Many information systems allow setting up automatic reports to data files (CSV, XML, Json),
- the use of BI functions within data warehouses or database systems,
- using a repeatable script that creates a data file directly from databases, or querying APIs,
- use of Robotic Process Automation (RPA) technologies that allow data collection or export from systems that do not have APIs or open access to databases,
- other possibilities.
Efforts to maximize automation in the design of datasets and in their subsequent updating contribute significantly to the sustainability of the entire catalog.