Below is a summary of the procedures and requirements that apply to archiving projects. Please read these guidelines prior to engaging the data center with an archive request. Note that other requirements specific to the program or supporting data center may apply.
There are several factors that influence project schedule: the size and complexity of the data, level of support requested, data center familiarity with the data provider and providing system, compliance with project requirements, other existing workloads, etc. It is important that potential data providers contact the data center early in the planning phase to ensure adequate time for archive planning and preparation. End-to-end management of the data for archiving should be planned at the start of research and considered throughout the project lifecycle. Including archiving in the project management plan is a critical step for ensuring adequate preparation for long-term, data center support. Coordinating with the data center during data development also allows opportunities for feedback to improve the data and metadata for archiving.
Selection of the appropriate data center should be based on the type of data being supported. See more information on the three NOAA data centers: NCDC, NGDC and NODC. The receiving data center can forward a request to another center as deemed appropriate.
In general, there are two gates to pass through in sequence to get data into the archive. These are 1) the archive appraisal and approval step, and 2) finalizing the data submission agreement. The data provider assists with these activities by providing the necessary information and by reviewing the archive-generated documents. Submitting an archive request to a data center initiates the archive appraisal and the involvement of the data center. (ATRAC provides an archive request form for registered projects under Edit Projects.) Information provided in the request and follow-on conversations are used by the data center to assess the archive value, feasibility and costs. A data center decision on supporting the data is based on an informed recommendation and is documented in a formal approval or disapproval communication from the archive.
For approved archive projects, the data provider and the data center negotiate the details of the data model and the transfer logistics in a data submission agreement. The submission agreement is a charter for both the provider and the data center on how the project will proceed during the operational transfer and through the end of archive support. A finalized submission agreement acts as the second and final gate to the archive. Even with a finalized submission agreement, providers are expected to maintain communications with the data center through the data submission and, if possible, through the life of the data archive.
Documentation and Metadata
The data center requires sufficient information to read, understand and characterize the data in accordance with documentation standards. This documentation is crucial for the ability to use and understand the data independent of external assistance. Data documentation, at a minimum, should allow users to:
- read the encoded data format
- understand the data lineage
- characterize the data quality
- uniquely identify the data
- and trust the data integrity
Metadata on data quality and lineage must be included with the data files. Information on the data lineage back to the source observational data is required for users who want to understand how the data were produced and what inputs were used.
Providers must also supply information sufficient for producing ISO standard metadata for the data collection(s). Data center representatives can provide guidance for this effort, and the metadata form in ATRAC can be used to collect and initiate this type of metadata. Additional requirements that vary by project may include documentation on the algorithms, production source code (if provided), and encoding format for any non-standard file formats.
File Format Standards
Open file formats maintained by a standards organization are strongly recommended as opposed to proprietary or product-specific formats. Self-describing formats that describe the data structure and fields, like netCDF and XML, include valuable metadata and are better suited for long-term information preservation. PDF or PDF for archiving (PDF/A) are most suitable for documents. Variables, attributes, and units should follow standard naming conventions, such as the Climate and Forecast (CF) conventions and the Dataset Discovery conventions.
A common file naming convention is required for all project files. File name fields for file identification should include:
- data type identifier
- data version identifier
- unique date/time stamp
- appropriate file format extension
- other applicable fields such as data source
All file name fields except for the file extension should be delimited by underscores '_'. The order of the fields in the file name should begin with the most static and end with the most dynamic. For example, a file name may begin with the less changing data type identifier field and end with the more changing time stamp field.
Data files may be aggregated and compressed in archive files for storage depending on the number of files, data volume, and other factors. Sets of files should be organized by common characteristics, including data type, format and temporal coverage. Data are stored as tar files cannot contain subdirectories or other tar files. Also, a README describing tar file contents should be included inside tar files with multiple file types though it is not needed for tar files with homogeneous content. An inventory or sample of the expected files can help the data center assess the most appropriate file names and data organization. Data packaging is usually discussed during the negotiation of the submission agreement.
An FTP pull or push is preferred for most data transfers to the data center. Other transfer protocols may be possible depending on the interface. Data providers are required to produce and deliver a 32-digit MD5 checksum value for each submitted file in a submission manifest to ensure the integrity of the data received by the data center. The format of the submitted checksums is discussed through the submission agreement.
- ISO 19115, Geographic information - Metadata
- NetCDF Attribute Convention for Dataset Discovery
- NetCDF (network Common Data Form)
- NetCDF CF Metadata Conventions
- NOAA Procedural Directives
- Producer-Archive Interface Methodology Abstract Standard (PAIMAS), CCSDS 651.0-M-1
- Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-B-1