Archive 1.x:Appending Datasets

From OSF Wiki
Jump to: navigation, search

Introduction

This page describes a medium size datasets importation workflow that uses the structImport and structAppend OSF-Drupal modules. This workflow includes review steps by the data maintainer that imports the dataset. It also uses the commON instance records serialization format as example. Note that the other formats (RDF+XML, irJSON, RDF+N3, etc.) can follow exactly the same guidelines.

Note: As emphasized below, appending to a dataset first has you creating a new dataset with the new or updated records. Then, when you have determined the import worked as you wish, you then append that newly imported set of records to an existing target dataset. At that point, you may delete or not the new import source dataset with its now duplicate records.

The Goal

The goal of this workflow is to be able to import medium size datasets (few thousands records), or to happend a few thousands of records to a bigger dataset, and this only by using the OSF-Drupal user interfaces.

Importation Workflow Overview

The overview of the workflow is the following:

  1. Plan the importation strategy
  2. Create the dataset where all slices will be appended
  3. Import the piece(s) of the dataset by using the structImport module. For each piece:
    1. Review what got imported into the node
    2. Once the review is accepted, append the dataset using the structAppend module
    3. Deleted the piece (source dataset) that got appended to the full dataset

Importation Workflow Explained

1. Plan the importation strategy

Different strategies can be developed depending on the size of the dataset you want to import into your node. Generally, we think about a medium size dataset as a dataset with a few thousands of records (lets say, between 10 000 to 50 000).

One thousand of records can easily be imported into your node by using the structImport module only. However, it would be trickier to import 25 000 of them, in one shot, using this lightweight user interface. This is the reason why we have to use a strategy such as the one discussed in this document.

If you only have 1000 or 2000 records to import, just import them directly using structImport, and stop your reading here. If you have more records than that, then I suggest you to continue your reading.

So the planning process consist in figuring out how you could split your medium size dataset into multiple slices. In this document, we consider that you have your dataset serialized using the commON format.

2. Create Dataset

Before importing all the slices of your dataset, you first have to create a dataset shell that will be used as the target dataset where to append all the slices you will import in this process.

You will create this dataset shell by using the structDataset OSF-Drupal module. This tool can be found at:

Then click on the "Create a new dataset" link at the bottom of the page. One you get to the dataset creation page, you have to fill the following edit boxes:

  1. Title: give a title to the dataset you are creating
  2. Description: give a description to the dataset you are creating
  3. WSF Address: enter the domain name (withtout any protocol identifiers such as "http". we are really just talking about the domain name, or IP address here), or the ip address of the OSF Web Services instance where you want to create the dataset. Normally it is the same as the one you used to access the dataset creation tool ("your-domain-name.com" above).
  4. Click the "save" button at the bottom of the page

3. Import Dataset Slices

The next step is to import all slices of your dataset, one by one, using the structImport conStruct module. For each of the slice you import, you will have to review, append and delete slices. The import module can be found here:

There are five things you have to specify in order to import a slice:

  1. Dataset file to import: you have to select the file of the slice of your dataset you want to import.
  2. Content type: the serialization format of the dataset slice you want to import (in our case, commON).
  3. Dataset name: the name of the dataset you want to import (ex: dataset A, slice X)
  4. Dataset description: the description of the dataset you want to import (ex: slice X)
  5. Save dataset on this network: the network where you want to save the dataset. Normally, the network is "localhost" or "your-domain-name.com". But if this OSF-Drupal web site has access to other networks (OSF Web Services nodes), then slices could be imported on these other nodes.
Note: each time you use the structImport module to import a slice of your dataset, the module will automatically create a dataset where it will index all the information, about all the records, of the slice.

See also the Adding a New Dataset documentation.

3.1 Review Imported Slice

Once a slice got imported, you can review what got imported by browsing and searching the dataset. If you find issues with it, you can modify or delete erroneous records description. This review process is easier to do with smaller slices of datasets.

You can normally access the Browse and Search tools here:

Note: if too many issues exist, you can simply delete the dataset of the slice, rework your dataset offline, and re-import it once the issues are fixed.

You can delete datasets from the structDataset module here:

3.2 Append Dataset Slice

Once you are done reviewing the slice you imported, you have to append all its content to the dataset you created in 2. above.

To happen a dataset (in this case, a slice) to another dataset, you have to use the structAppend OSF-Drupal module:

Appending a dataset to another dataset is a 5 steps process:

  1. Set Source Dataset
  2. Set Target Dataset
  3. Append Datasets
  4. Appending...
  5. Delete Source Dataset (Optional)

The first step is to select the source dataset. That source dataset is the slice you imported in 3.0 above. Then, you have to select the target dataset which is the dataset you create in 2. above. Then you have to confirm that you want to append the content of the source dataset to the one of the target dataset. Then the system will show you a progress bar of the appending task. Once the append process is finished, it will ask you if you want to delete the source dataset (the slice), and you will say yes.

Note: if you say no to the last step, you can always delete it in the future.

Then you perform these steps for each slice you have in your dataset.

See also the Adding a New Dataset documentation.