You are here

Global Historical Climate Network Daily - Methods

Data Integration

The process of integrating data from multiple sources into the Global Historical Climatology Network (GHCN)-Daily dataset takes place in three steps:

  • Screening the source data for stations whose identity is unknown or questionable
  • Classifying each station in a source dataset either as one that is already represented in GHCN-Daily or as a new site
  • Mingling the data from the different sources

The process performs the first two of these steps whenever a new source dataset or additional stations become available, while the mingling of data is part of the automated processing that creates GHCN-Daily on a regular basis.

Screening the Source Data

A station within a source dataset is considered for inclusion in GHCN-Daily if it meets all of the following conditions:

  • It can be identified with a name, latitude, and longitude contained in metadata provided as part of the source dataset or in standard station history information
  • Its record contains 100 or more values for at least one of the GHCN-Daily elements
  • It does not fail the interstation duplicate check which compares records from all stations within a source dataset in order to identify cases in which more than 50% of a station's record is identical to the data for another station

Classifying Stations

The next step is to determine, for each station in the source dataset, whether data for the same location are already contained in GHCN-Daily or whether the station represents a new site. Whenever possible, stations are matched on the basis of network affiliation and station identification number. If there is no such match, there is consultation from different networks for existing cross-referenced lists that identify the correspondence of station identification numbers. For example, data for Alabaster Shelby County Airport, Alabama, USA, is stored under Cooperative station ID 010116 in NCEI's datasets 3200 and 3206 as well as in the data stream from the High Plains Regional Climate Center; they are combined into one GHCN-DAILY record based on the ID. In data set 3210 and the various sources for ASOS stations, however, the data for this location are stored under WBAN ID 53864, which must be matched with the corresponding Cooperative station ID using NCEI's Master Station History Record.

A third approach is to match stations on the basis of their names and location. This strategy is more difficult to automate than the other two approaches because identification of multiple stations within the same city or town, with the same name and small differences in coordinates, can be the result of either differences in accuracy or the existence of multiple stations in close proximity to each other. Consequently, the employment of the third approach is used only when stations cannot be matched on the basis of station identification numbers or cross-reference information. This was the case, for example, when there is a need for matching stations outside the United States whose data originate from the Global Summary of the Day dataset and from the International Collection.

Mingling

The implementation of the above classification strategies yields a list of GHCN-Daily stations and an inventory of the source datasets for integration of each station. This list forms the basis for integrating, or mingling, the data from the various sources to create GHCN-Daily. Mingling takes place according to a hierarchy of data sources and in a manner that attempts to maximize the amount of data included while also minimizing the degree to which data from sources with different characteristics are mixed. While the mingling of precipitation, snowfall, and snow depth are separate, consideration of maximum and minimum temperatures is performed together in order to ensure that the temperatures for a particular station and day always originate from the same source. Data from the Global Summary of the Day dataset, whose observations tend to apply to 24-hour summary periods that differ from those reported by other sources, are used only if no observations are available from any other source for that station, month, and element. Among the other sources, consideration of each day is made individually; if an observation for a particular station and day is available from more than one source, GHCN-Daily uses the observation from the most preferred source available.

Several criteria are used for the hierarchy of data sources used in cases of overlap. In general before integration into GHCN-Daily, the higher priority sources are those that have received the greatest amount of scrutiny versus those from fully automated, real-time data streams. At station networks within the United States, sources providing Cooperative Summary of the Day data are given preference over other data streams since they contribute the largest amount of data. For international stations, the official Governmental Exchange Data are preferred over the International Collection when observations from these two sources are present on the same day. Lastly, there may be comparisons of a new source of data for a particular station to station data already contained in GHCN-Daily. If data from the new source match data for a station already added to GHCN-Daily during their common overlap period, and the match rate is at least 50% for all elements, and the new station and the existing GHCN-Daily station are within 40 km of one another (based on their respective coordinates), then the new station data is added as an additional source to the existing GHCN-Daily station record.

Quality Control

During each reprocessing cycle, the data are first passed through a "format checking program" that looks for problems such as impossible months or days, invalid characters in data fields, and so forth. If this occurs, the routine sets the offending records to missing. The primary purpose of this program is to ensure that the data integration procedures do not either introduce or retain records that violate the intended and documented GHCN-Daily data format. Next, a comprehensive sequence of fully-automated QA procedures identifies daily values that violate one of the quality tests. Described below and in greater detail in Durre et al. (2010), these tests identify a variety of data problems, including the excessive duplication of data records; exceedance of physical, absolute, and climatological limits; excessive temporal persistence; excessively large gaps in the distributions of values; internal inconsistencies among elements; and inconsistencies with observations at neighboring stations. This system flags approximately 0.3% of nearly 2 billion data values. It is estimated that 98-99% of the values flagged are true data errors and only 1-2% is false positives (Durre et al. 2010). Achievement of this level of performance is through careful selection and evaluation of procedures and test thresholds using the techniques described by Durre et al. (2008). The tests are as follows (see the readme.txt file for a list of the flags assigned when a particular test fails): 

  • Trace flag consistency check- Checks for days on which the data measurement flag indicates a trace yet the amount is nonzero. This flag applies to precipitation, snowfall, snow depth, evaporation, water equivalent of snow on the ground, and wind movement.
  • Naught check - Checks for days on which maximum and minimum temperature are both equal to 0°C at stations not operated by the United States or are both equal to -17.8°C (0°F) at United States stations.
  • Duplicate data check- Checks for duplication of the data between entire years, different years in the same calendar month, and different months within the same year. This check applies to air, evaporation pan, and soil temperatures, precipitation and snowfall.
  • World record exceedance check- Identifies values that fall outside the world extremes for the highest and lowest ever observed. This check applies to all elements except weather types.
  • Streak check- Checks for unrealistic sequences of identical values in time series of nonmissing values (or in non-missing/non-zero values in the case of precipitation). Flags sequences of
    • 20 or more consecutive identical values in time series of non-missing daily maximum, minimum, and observation time air temperature;
    • 20 or more consecutive identical values in time series of non-missing and non-zero precipitation observations;
    • 10 or more consecutive identical non-zero values in time series of non-missing snowfall totals;
    • 90 or more consecutive identical non-zero values in time series of non-missing snow depth values.
  • Frequent-value check (precipitation only) - Checks for clusters of 5-9 identical moderate to heavy daily totals in time series of non-zero precipitation observations.
  • Gap check - Identifies unrealistic breaks in the period-of-record distribution of elements for a particular calendar month. Flags:
    • Maximum/minimum air, evaporation pan, or soil temperatures that are at least 10°C warmer or colder than all other corresponding maximum/minimum temperatures for a given station and calendar month
    • Precipitation values that are at least 300 mm larger than all other precipitation totals for a given station and calendar month
    • Snow depth values that are at least 35 cm larger than all other reported snow depths for a given station and calendar month
  • Z-score-based climatological outlier check - Checks for daily surface air maximum and minimum temperatures that exceed the respective 15-day climatological means by at least six standard deviations.
  • Percentile-based climatological outlier check - Checks for daily precipitation totals that exceed the respective 29-day climatological 95th percentiles by at least a certain factor (9 when the day's mean temperature is above freezing, 5 when it is below freezing).
  • Internal temperature consistency check - Checks for consistency among maximum, minimum, and time of observation temperature within a three-day window. This check applies to air, evaporation pan, and soil temperatures.
  • Temporal consistency check (spike or dip) - Checks whether a daily maximum (minimum) temperature exceeds the maximum (minimum) temperatures on the preceding and following days by more than 25°C.
  • Lagged temperature range check - Identifies maximum temperatures that are at least 40°C warmer than the minimum temperatures on the preceding, current, and following days as well as minimum temperatures that are at least 40°C colder than the maximum temperatures within the three-day window.
  • Consistency check between evaporation pan temperatures and surface air temperatures (flags pan temperature only). - Checks for inconsistencies between:
    • Maximum surface air temperature and minimum evaporation pan temperature;
    • Maximum evaporation pan temperature and minimum surface air temperature;
    • Maximum evaporation pan temperature and maximum surface air temperature plus 10°C
    • Minimum evaporation pan temperature and minimum surface air temperature minus 10°C.
  • Snow-temperature consistency (warm) check - Checks for non-zero snowfall totals that occur when daily minimum temperatures at the same station are equal to or warmer than 7°C
  • Snowfall to snow depth increase consistency check - Checks for days on which the increase in snow depth from the previous day to the current day exceeds the current+previous and current+following days' snowfall total by more than 25 mm.
  • Snowfall (or snow depth increase) to precipitation ratio check- Checks for cases in which snowfall (or snow depth increase) is excessively large compared to precipitation. If so, the current day's precipitation and snowfall (or snow depth increase) totals fail the check on the preceding, current, and following days.
  • Spatial consistency check (regression) Checks for temperatures that differ greatly from a predicted value generated from a linear-regression-based estimate generated from neighboring values. Flagging of a target temperature is when the regression-based predicted value differs by more than 8°C from the observed value, and the standardized residual of the predicted value exceeds four standard deviations on the target day.
  • Spatial consistency check (corroboration of anomalies)- checks for temperatures whose anomalies differ by more than 10°C from the anomalies at neighboring stations on the preceding, current, and following days.
  • Spatial consistency check (corroboration of precipitation amounts and percentiles)- checks for precipitation totals that differ significantly from totals (and percentiles) reported at neighboring stations on the preceding, current, and following days.
  • Spatial consistency check (snow to minimum temperatures)- checks for snowfall or snow depth increases when all neighboring stations reported a minimum temperature greater than 7°C on the preceding, current, and following days.
  • Mega consistency check - Flags:
    • daily maximum surface air temperatures that are less than the lowest minimum surface air temperature for the respective station and calendar month;
    • daily minimum temperatures that are greater than the highest maximum temperature for the station and calendar month;
    • observation-time temperatures that are higher than the highest maximum temperature or lower than the lowest minimum temperature for the station and calendar month;
    • daily maximum evaporation pan temperatures that are less than the lowest minimum evaporation pan temperature for the respective station and calendar month, less than the lowest minimum surface air temperature for the respective station and calendar month, or more than 10°C above the highest surface air temperature for the respective station and calendar month;
    • daily maximum evaporation pan temperatures that are less than the lowest minimum temperature for the respective station and calendar month;
    • daily minimum evaporation pan temperatures that are greater than the highest maximum evaporation pan temperature, greater than the highest maximum surface air temperature, or 10°C below the lowest minimum surface air temperature for the station and calendar month;
    • daily maximum soil temperatures that are less than the lowest minimum soil temperature for the station, calendar month, ground cover, and depth;
    • daily minimum soil temperatures that are greater than the highest maximum soil temperature for the station, calendar month, ground cover, and depth;
    • flags nonzero snowfall and snow depth values for stations in calendar months whose lowest reported minimum temperature is 7°C or warmer. The check is applied only if there are at least 140 daily minimum temperatures for the station and calendar month;
    • warm season non-zero snowfall totals at stations where no valid cold season snowfall was ever reported;
    • warm season non-zero snow depths at stations where no valid cold season snow depth was ever reported. (The definition of warm season is May-September in the Northern Hemisphere and October-April in the Southern Hemisphere. The remaining months of the year comprise the cold season).
  • Date-based climatological outlier check for snowfall and snow depth.- Flags snowfall and snow depth values that fall outside their respective plausible seasons as determined from respective observations at the station and neighboring stations within 1° latitude of the station. The design of this check is to remove non-zero observations in locations/seasons where snow is not plausible but not flagged by any other check. Note this check has higher false positive rate (50% for snowfall and 75% for snow depth) than the GHCN-Daily standard of less than 20%. The intent is to improve this check in the future.

Bias Adjustment

Unlike GHCN-Monthly , GHCN-Daily does not contain adjustments for biases resulting from historical changes in instrumentation and observing practices. It should be noted that historically (and in general); the deployed stations providing daily summaries for the dataset were not designed to meet all of the desired standards for climate monitoring. Rather, the deployment of the stations was to meet the demands of agriculture, hydrology, weather forecasting, aviation etc. Because GHCN-Daily has not been homogenized to account for artifacts associated with the various eras in reporting practice at any particular station (i.e., for changes in systematic bias), users should consider whether the potential for changes in systematic bias might be important to their application. In addition, GHCN-Daily and GHCN-Monthly are not internally consistent (i.e., GHCN-Monthly is not necessarily derived from the data in GHCN-Daily) until the release of GHCN-Monthly version 4.