Archive 1.x:Datasets: Import Dataset Files

Once the dataset is prepared, it then must be imported into the framework via structWSF. There are multiple ways to import datasets into a structWSF node, which are explained below. Some datasets will be small (a few hundred to a few thousands of records) and others much bigger (millions of records).

It is also possible to append datasets when updates are relatively minor for smaller- to medium-sized datasets.

Initial Imports
Depending on the size of the datasets to import there are different methods that can be used to manage the process.

Small Datasets
Small datasets can easily be imported in a structWSF node by using the OSF-Drupal Import module. It is just a matter of having the dataset described and serialized using one of the suported format. Once it gets imported, a new dataset get created where the records description get indexed.

Medium-sized Datasets
Medium size datasets (thousands of records) can still be imported using the OSF-Drupal Import module. However, smaller slices should be imported, one by one. Once a slice is imported, it can be manually reviewed by the importer, and then appended to the dataset where all slices will be merged together. The appending is done using the OSF-Drupal structAppend module.

Large Datasets
Bigger datasets (hundred of thousands to million of records) should be handled by dedicated importation scripts. These scripts should split the big dataset in slices of a few hundred of records. Then, each slice should be sent to the CRUD: Create web service endpoint. Mechanism should be put in place to restart at a specific slice if something goes wrong in the importation process.

Setup Considerations
Importing a big dataset can be challenging in terms of planning and management. Because of the different pieces that are using in a structWSF instance, some of them react better than other in terms of indexation time, and the impact on a production system.

In the best of the world, big datasets should be indexed offline and switched to the production instance once ready. However, this is not always possible depending on the resources accessible to the organization.

In this case, we will see how we can change the configuration of a structWSF instance to minimize the impacts on a production server instance.

Two pieces of software are used in structWSF when come the time to talk about records indexation. Currently, the two data management systems that are involved are: Virtuoso and Solr. For example, Solr is used by the Browse and Search web service endpoint, while Virtuso is used by the other web service endpoints.

Virtuoso has no problem indexing records in real time while answering other kind of queries. Everything is done in parallel, and indexation has no big impact on other queries. Commit of new data is done on a periodical basis (depending on the settings of the instance), but uncommited data is available to the users.

However, the scenario is not the same for Solr. With Solr, un-commited data is not available to people. Indexation time in Solr takes much more time than with Virtuoso, but commit time is similar. The big problem with Solr is that un-commited data is not available to people and commiting data takes time. This brings two new scenarios on the table:


 * 1) The first indexation of a big dataset within the system
 * 2) Subsequent update of the big dataset

In the first scenario, we don't care if people can't access its data until the full importation is done. In the second scenario, we do want to make sure that each time a new record is added, an existing record get updated or that a record got deleted, that this change immediately appear in the search and browse indexes.

Generally, for the first usecase above, you should have the setting solr_auto_commit to "TRUE", in the data.ini file of the structWSF instance. Then, you should make sure that you enabled the &lt;autoCommit&gt; element in the solrconfig.xml Slr configuration file. For some versions, it won't work if both options &lt;maxDocs&gt; and &lt;maxTime&gt; are setuped, but only if one of the two are used. So this is something to take care about.

Then, once you are done with the full import, the scenario 2 kicks-in, and then you should turn the solr_auto_commit to "FALSE". That way, each time a record get added, modified or deleted, it will automatically appear on the search and browse indexes.

So, each time you have a big dataset to import, you should take care to change the setting while it imports the data. In some case, even if you have the auto_commit set to TRUE, you could experience slower queries time as the index become bigger and bigger. In such case, consider giving more memory to the Solr instance, at least for the time of indexation.

Performance Check List
Depending on your usecase, make sure that you optimize your importation performances check checking these points on the checklist:


 * 1) Can I provide more memory to the running Virtuoso instance?
 * 2) Can I provide more memory to the running Solr instance?
 * 3) Do I feed the Crud: Create web service endpoint with an aggregation of 100 or 200 records (and not one by one)?
 * 4) Is the Solr auto-commit setting set to TRUE?
 * 5) If the Tracker: Create Web service endpoint is enabled for create, then disable it in network.ini before starting a big dataset import.

Doing an Import with OSF-Drupal
See also the Adding a New Dataset documentation.

Dealing with Missing Attributes and Types
If you are using structImport to import a dataset and that the option "Check for missing attributes and types in the imported dataset." is enable, or if your importation script support that functionality, then each time you import a new dataset, the system will tell you which attributes, or types, used in the dataset are missing in the ontologies structure currently used by the OSF instance.

Read more about what should be done once such attributes and/or types are detected at importation time.

Update Datasets
To append new records to an existing dataset, see the document on how to append datasets.