Archive 1.x:Datasets: Identify

This article deals with how to identify and bound (set the scope) for a given dataset.

Datasets are the basis for organizing structured data within the system. They contain one or more records from a single source representing the same type of instance(s). Datasets may come from the Web or locally, and have standard metadata.

At minimum, datasets have a simple structure of attribute-value pairs for each instance record. Datasets best share scope (types) and attributes. Datasets may differ by:


 * Types and attributes
 * Source
 * When created, or
 * Access rights.

We discuss each of these in turn.

Types and Attributes
A type is a kind of thing, what is known as a class in set theory. However, saying that also depends on the granularity of the data. While "animal" is a kind of thing, it might a good type in a general environment where animals are a minor consideration, but would likely be too generic for zoologists. For them, they might want granularity at the family (cat, dog, ape) level or even at the genus (domestic cat, wolf, gorilla) level.

Another way to look at type is via attributes, or properties. Similar things tend to have similar characteristics or similar ways in which they can be described. These similar things share similar properties, the combination of which help to scope or define the types at hand.

A population of shared attributes (for multiple records) thus tends to act in a similar way to have same types. Similarity or sameness of scope can therefore be determined either on the basis of shared attributes or similar types.

Though there are no technical reasons to prevent it, in general it is best to bound a dataset to a single type or to mostly shared attributes. This leads to cleaner records and also tends to conform to some of the other kinds of distinctions below.

Source
Another difference that might be encountered in datasets is source. Different sources of data (provenance) may be more complete, or more authoritative, or more frequently updated, or other differences that establish a different value or worth from datasets of alternate provenance. These differences can be important when the data is actually used, with some sources being preferred over others or even some sources rejected altogether.

It is thus good practice to keep datasets distinguished by source. Like types, there is granularity here as well. Some sources differ by author, while others differ by publisher or access point (such as Wikipedia). Apply the source granularity appropriate to your needs and circumstances.

When Created
Data is also produced or published at various times. Sometimes this reflects usefulness for time-series analysis (decennial Censuses in 1980, 1990, 2010, 2010, etc). Time differences may also reflect the timeliness or update cycle for the information. Baseball statistics up to 1986 are likely of less usefulness than recent compilations.

On the other hand, some datasets are updated on a frequent and reliable basis. Since the source may be authoritative and updated on a timely basis, there is likely no need to maintain different time slices of that data.

In any event, the time dimension of when data is created, published or updated should be taken into consideration when scoping your datasets.

Access Rights
Another important dimension is access rights. In a company context, for example, access to employee data may differ widely. All employees may be able to access a basic staff directory. Supervisors might be able to see salaries for employees in their department. And HR or authorized supervisors may have access rights to change pay levels or directory records.

Because OSF access rights work at the level of the dataset, it is very important that you reflect such distinctions in how you scope those datasets. In the employee example above, for example, it likely makes sense to split the entire database into three different datasets to reflect the access and write distinctions noted.

Other Considerations
For manageability, dataset size is another consideration. At present, export from OSF-Drupal is limited to "slices" of 1000 records at a time. So, while there is no technical upper limit to a dataset size, possible use cases might suggest splitting large sets into smaller ones.

Ultimately, of course, each individual circumstance will differ. An installation with both many and large datasets may opt for more aggregation in order to keep dataset numbers manageable.