Methods - Data Integration
The process of integrating data from multiple sources into the GHCN-Daily dataset takes place in three steps: screening the source data for stations whose identity is unknown or questionable; classifying each station in a source dataset either as one that is already represented in GHCN-Daily or as a new site; and mingling the data from the different sources. The first two of these steps are performed whenever a new source dataset or additional stations become available, while the actual mingling of data is part of the automated processing that creates GHCN-Daily on a regular basis.
Screening the Source Data
The data record for each station within a source dataset is considered for inclusion in GHCN-Daily if it meets all of the following conditions:
- It can be identified with a name, latitude, and longitude contained in metadata provided as part of the source dataset or in standard station history information
- Its record contains 100 or more values for at least one of the GHCN-Daily elements
- It does not fail the interstation duplicate check which compares records from all stations within a source dataset in order to identify cases in which more than 50% of a station's record is identical to the data for another station
The next step is to determine, for each station in the source dataset, whether data for the same location are already contained in GHCN daily or whether the station represents a new site. Whenever possible, stations are matched on the basis of network affiliation and station identification number. If no such match can be found, existing cross-referenced lists that identify the correspondence of station identification numbers from different networks are consulted. For example, data for Alabaster Shelby County Airport, Alabama, USA, are stored under Cooperative station ID 010116 in NCDC's datasets 3200 and 3206 as well as in the data stream from the High Plains Regional Climate Center. They can therefore be combined into one GHCN-Daily record based on that ID. In data set 3210 and the various sources for ASOS stations, however, the data for this location are stored under WBAN ID 53864, which must be matched with the corresponding Cooperative station ID using NCDC's Master Station History Record.
A third approach is to match stations on the basis of their names and location. This strategy is more difficult to automate than the other two approaches because multiple stations within the same city or town may be identified with the same name, and small differences in coordinates can be the result of either differences in accuracy or the existence of multiple stations in close proximity to each other. Consequently, the third approach is employed only when stations cannot be matched on the basis of station identification numbers or cross-reference information. This was the case, for example, for stations outside the United States whose data from the Global Summary of the Day needed to be matched with data from the International Collection.
The implementation of the above classification strategies yields a list of GHCN-Daily stations and an inventory of the source datasets to be integrated for each station. This list forms the basis for integrating, or mingling, the data from the various sources to create GHCN-Daily. Mingling takes place according to a hierarchy of data sources and in a manner that attempts to maximize the amount of data included while also minimizing the degree to which data from sources with different characteristics are mixed. While precipitation, snowfall, and snow depth are mingled separately, maximum and minimum temperatures are considered together in order to ensure that the temperatures for a particular station and day always originate from the same source. The High Plains Regional Climate Center and Global Summary of the Day, whose observations tend to apply to 24-hour summary periods that differ from those reported by other sources, are used only if no observations are available from any other source for that station, month, and element. Among the other sources, each day is considered individually; if an observation for a particular station and day is available from more than one source, the observation from the most preferred source available is used in GHCN-Daily.
The hierarchy of data sources used in cases of overlap is based on several criteria. In general, data that have received the greatest amount of scrutiny before being integrated into GHCN-Daily are chosen over fully automated, real-time data streams. At stations operated by the United States, sources providing a Cooperative Summary of the Day are given preference over other data streams since they contribute the largest amount of data. For international stations, the official Governmental Exchange Data are preferred over the International Collection when observations from these two sources are present on the same day. Lastly, a new source of data for a particular station may be compared to station data already contained in GHCN-Daily. If data from the new source match data for a station already added to GHCN-Daily during their common overlap period, and the match rate is at least 50% for all elements, and the new station and the existing GHCN-Daily station are within 40 km of one another (based on their respective coordinates), then the new station data is added as an additional source to the relevant GHCN-Daily station record.