4. Database Concepts

The EBAS data model

The model presented here is very simplified version of the technical design data model of EBAS and shows only the basic concepts in a user understandable way. Basic knowledge of the data model will help the user in the daily work with metadata and data.

Overview

../../_images/EBAS-3-user.png

Classes of metadata

Master data

Master data entities are shown in green in the overview figure above.

Master data are general metadata that are referenced by other metadata entities. Master data should not change over time (changes are at least change very seldom and are not considered as change of the metadata, but as a correction). Historic states of master data are not preserved and historic extracts will always produce the latest state of master data.

A typical examples are Station metadata, which are referenced by datasets. The station metadata (Station name, position, altitude, …) are static and need to be the same for all measurements performed at this station. Changes, e.g. in the Station position would be corrections (if a station was physically moved, a new station needs to be created and referenced). The same applies for organisation metadata.

The vast amount of master data are controlled vocabulary (e.g. Statistics code, Instrument type, Component name, …). Those master data entities are not shown in the overview figure in order to keep the figure simple.

Static metadata

Static metadata entities are shown in blue in the overview figure above.

Some metadata are considered to be immutable. Rather then changing those metadata, the entities have to be deleted and recreated. Examples for static metadata are submissions and the dataset core metadata.

History aware metadata

History aware metadata entities are shown in red in the overview figure above.

History aware metadata keep the full history of changes in the database. See Historic states of data for more information.

Time dependent, history aware metadata

Time dependent, history aware metadata entities are also shown in red in the overview figure above.

In addition to being history aware, those metadata are always valid for a specific data interval of the timeseries and can have different values for different data intervals.

An example could be the detection limit: Different detection limits can be reported in submissions for different years (time dependent). Additionally, the detection limit for one specific year can be changed afterwards and the historic value before the change will still be available in the database (history aware).

Entities

Dataset

The central part of the EBAS data model is the dataset. A dataset represents all metadata and data for one specific measurement variable over time.

Homogeneity of datasets:

A dataset is homogeneous in the sense that data from different measurement intervals within the whole dataset are comparable and without incontinuities caused by changes in instrument configuration or method.

A dataset consists of:

  • Dataset setkey, a unique identifier for the dataset.

  • Dataset core metadata, which define the identity of the dataset.

  • Additional dataset metadata, which are mutable.

  • Time dependent dataset metadata, which are mutable and can have different values for different data intervals.

  • References to station, laboratory, field instrument metadata, laboratory instrument metadata and QA metadata

    Datasets refer to other metadata entities. Those referred entities are uniquely defined (e.g. all station metadata will be the same for all datasets referring to the same station).

  • Measurement data (time series)

Dataset core metadata

Dataset core metadata define the identity of a dataset. The dataset core metadata bind also a dataset setkey. Two datasets with identical core metadata would be indistinguishable and may not exist in parallel.

As the core metadata identify the dataset, they may never change in the lifetime of a dataset. Thus the dataset core metadata are implemented as a static metadata entity.

Core metadata are:

Dataset characteristics

Some parameters in EBAS need additional metadata to describe the quality of the variable. This additional metadata are called characteristics, as the describe special characteristics of a parameter.

Dataset characteristics are part of the Dataset core metadata and may not change in the lifetime of a dataset. Thus they are implemented as static metadata.

Examples for characteristics are:

  • Wavelength for nephelometer measurements: The parameter:

    needs one more metadata element to describe the parameter of measurement:

    • Wavelength: Nephelometers measure light scattering in different wavelength. Some nephelometers measure the scattering at 3 wavelengths. Thus they report 3 variables with the same parameter, but with different characteristics (e.g. Wavelength=450nm, Wavelength=525nm and Wavelength=635nm)
  • Size bin for dmps measurements: The parameter:

    needs one more metadata element to describe the parameter of measurement:

    • Median size (D) or
    • Minimum (Dmin) and maximum (Dmax) size of the size bin:

    DMPS instruments measure the particle concentration in different size bins. The number concentration in each size bin is reported as one variable. Thus the size bin needs to be specified by the above mentioned characteristics.

Additional dataset metadata

Additional dataset metadata can change historically through the lifetime of a dataset. Changes are considered as corrections or additions (metadata were not known before).

However, those metadata are not time dependent and need to be constant over the whole time series (changes in those metadata over time would break the continuity criteria of a dataset and the creation of a new dataset is indicated).

Additional dataset metadata are implemented as history aware metadata.

  • External laboratory (performing the analysis)
  • Data level
  • Standard method
  • Filter medium, coating and/or solution
  • Inlet type
  • Humidity/temperature control
  • The standard conditions the measurements are based on (standard temperature, standard pressure)

New in version 3.01.00: following attributes were added:

  • Absorption cross section
  • Sensor type

Time dependent dataset metadata

Time dependent dataset metadata are dataset metadata which can have different values for different time intervals of the time series. Additionally they have full history support. Thus they are implemented as a time dependent, history aware metadata entity.

  • Statement about occurrence of zero or negative values
  • Sample preparation
  • Balnk correction
  • Detection limit
  • Uncertainty (relative or absolute)
  • Calibration standard ID
  • Inlet description (free text; inlet type is defined in Additional dataset metadata)
  • Humidity/temperature control description (free text; Humidity/temperature control is defined in Additional dataset metadata)
  • Measurement latitude
  • Measurement longitude
  • Measurement altitude
  • Measurement height
  • Orig. time res.
  • Sample duration
  • Comment

New in version 3.01.00: following attributes were added:

  • Upper range limit
  • Secondary standard ID
  • Inlet tube material
  • Inlet tube outer diameter
  • Inlet tube inner diameter
  • Inlet tube length
  • Maintenance description
  • Zero/span check type
  • Zero/span check interval
  • Flow rate
  • Filter face velocity
  • Exposed filter area
  • Filter description
  • Filter prefiring (prefiring codeword, temperature, time)
  • Filter conditioning (yes/no, temp, RH, time)
  • Artifact correction
  • Artifact correction description
  • Charring correction
  • Water vapor correction
  • Ozone correction

Instrument metadata

The instrument metadata are composed of

Instrument core metadata

Instrument core metadata are composed of:

Note

Instrument naming

Choosing the instrument name is not always straight forward. Especially when changing instruments (e.g. using a new instrument model, or the same model with a different serial number), it can be difficult to decide about the instrument naming.

Generally, this is a question not only of instrument name and instrument identity, but implicitly also of dataset identity and the homogeneity of datasets.

As a general rule, when the measurements are still comparable with the ones done with the old instrument setup, and they show no incontinuities due to the instrument change, the instrument name can be (but does not have to be) the same,

:term`Instrument manufacturer`, :term`instrument model` and instrument serial number can be specified seperately for each reporting period regardless of the instrument name being used (see also Time dependent instrument metadata). This enables the use of the same instrument name with different instrument models or serial numbers.

If a period of co-located measurements is performed (with the old and the new instrument operating at the same time), a new instrument name needs to be created, otherwise the measurements could not be distinguished,

If the results are expected to be not comparable, a new instrument name must be assigned as well.

A new instrument name will always result in the creation of new datasets.

Example: If the dmps at Zeppelin mountain has been exchanged with a similar instrument and the measurements are comparable, the lab can report the measurements still with instrument name dmps_no42, but report a different :term`Instrument manufacturer`, :term`Instrument model` and :term`Instrument serial number` for the next reporting period.

Time dependent instrument metadata

Some attributes of instrument metadata may change over time even if the instrument identity (Instrument reference) and the core metadata are the same:

See also the note on instrument naming for details.

Time dependent instrument metadata are implemented as time dependent, history aware metadata entity.

Analytical instrument metadata

New in version 3.01.00.

The analytical instrument metadata are composed of

Analytical instrument core metadata

Instrument core metadata are composed of:

Note

Analytical instrument naming

The analytical instrument name should be a name used in the lab for refering to an instrument. The data model allows for using the same name even if the physical instrument changes over time (e.g. change of instrument). One analytical instrument is assigned a manufacturer, instrument model and serial number in the time dependent analytical instrument metadata.

Unlike the field instruments (where a new instrument name requitres a new dataset), analytical instruments can change over time within one dataset. The reason for this is that very often laboratories use several instruments with the same analytical measurement technique interchangeably (i.e. samples from one site may be analysed on different instruments), but still the timeseries is considered to be consistent. The relation of dataset and laboratory instruments is defined by the Time dependent analytical instrument employment

Time dependent analytical instrument metadata

Some attributes of the laboratory instrument metadata may change over time even if the instrument identity (analytical instrument reference) and the core metadata are the same:

See also the note on analytical instrument naming for details.

Time dependent analytical instrument metadata are implemented as time dependent, history aware metadata entity.

Time dependent analytical instrument employment

The relation which laboratory instrument was used for a given time series may change over time even if the dataset is considered to be consistent.

The laboratory instrument can be bound to a dataset for a given valid time interval.

See also the note on analytical instrument naming for details.

Time dependent analytical instrument employment is implemented as time dependent, history aware metadata entity.

QA Metadata

New in version 3.01.00.

The QA metadata are composed of

  • reference to a dataset
  • reference to a QA measure (which can be a interlaboratory comparison, on-site or off-site intercomparison or an on-site audit)
  • data of the QA measure performed
  • valid time interval (measurement time interval for which the QA is valid)
  • QA specific data:
    • general outcome (pass, no pass, not participated)
    • bias (relative or absolute)
    • variability (relative or absolute)
    • documentation about the QA (document name, date, URL)

QA metadata are implemented as time dependent, history aware metadata entity.

Submission

The submission entity stores all metadata related to the submitted data file itself.

A submission represents a datafile that has been reported to EBAS and ingested into the database. One submission (datafile) can contain one or more variables. Each variable relates to one dataset in EBAS, but one submission contains only data for one submission interval (usually one year, the dataset usually contains data from multiple submission intervals).

  • Origin of data:
    • Organization which produced the data
    • Data originator and submitter roles
  • Revision information (version, description, revision date)
  • NILU staff who imported the data

Submissions are stored as static metadata. A submission will never cease to exists, it can only be superseded by a new submission, but even this leaves the original submission as a historic fact.

Roles

Roles describe the role of persons who contributed in producing the data. There are two types of roles:

Roles are related to data submissions. There must be at least one data originator and one data submitter for each submission.

Roles are stored as static metadata.

Project associations

Project associations associate a certain time interval of data of a dataset to a framework.

Each dataset can be associated to multiple frameworks, even at the same or overlapping time intervals. But each dataset must be associated to at least one framework for any time interval of it’s data (there may not exist any time interval of data without framework association)

Historic states of data

EBAS keeps the full history of changes in the database. Any historic state of the database can be reproduced. This enables some additional features which will be described in the following sub-chapters.

There are however some restrictions to the history function:

  • History is supported since the release 3.0 of EBAS. EBAS 3.0 was rolled out in May 2014. Thus the history is available since this date. Older data appear as if inserted 1st May 2014 (2014-05-01T00:00:00).
  • NRT data data are stored without any historic information. All metadata and data are just stored in the latest state.
  • Some rare database maintenance requires changes that are not visible in the history of the database. This is mainly the case when changing master data. Those changes are avoided as much as possible.

Operation with historic database state (Time travel)

All EBAS programs that query data (e.g. ebas_list_ds, ebas_extract, all statistics programs and many more) can query the database as if it was any historic date in the past using the --state argument. The result of the operation will be the same as if the operation had been performed at the historic point in time specified. This can be thought of as a time travel option (unfortunately we can only travel back in time - sorry, no future observations in this version of EBAS).

Differences between two (historic) database states

Another utilization of the EBAS history is the possibility of restricting EBAS programs to just work on data and metadata that changed between two historic database states. This can be achieved with the --diff argument. Only datasets changed between the the database state and this date will be processed.

A special case of this feature is the possibility of differential data extracts (see ebas_extract - differential extracts).

Near realtime data

Near real time (NRT) data in EBAS are usually available within two hours after the observation.

NRT dataset are specially handled in the database in many respects.

The high frequency of changes to each NRT dataset (usually one change per hour) makes it impossible to keep the history of changes in the database. With NRT data, only the latest state of the data is stored in the database, if a historic state of the data is accessed, the time series appears as it was at the historic timestamp, but measurement samples up to the current state of the database are reported as missing (not as not existing as it was correct at the historic state). This is a side effect of avoiding the historic changes to be stored in the database. Data that would have been future data in the perspective of the historic state appear as missing.

Furthermore, the project acronyms associated to NRT datasets will always end with _NRT. This is the way NRT data are marked for data users. Additionally, data policies will generally be different for NRT data in all frameworks. Thus a different project acronym, implying a different data policy and different access rights is needed for all projects.

Instrument names and instrument references of NRT data will always end with _NRT. This is necessary in order to make instrument metadata of NRT data completely independent from regular (quality assured data). The submission of quality assured data should in no way change the instrument metadata of stored NRT data of the same (physical) instrument and vice versa. Additionally problematic is the fact, that instrument metadata for NRT metadata should not be history aware, and need to be handled differently whenever inserted, changed or deleted. Therefor we create an additional “virtual” instrument for NRT data, even though in reality it’s the same physical instrument.

All time dependent metadata will only feature one gapless interval for NRT data.