NorStore - Norwegian Storage Infrastructure
Mission statement
The objective of the project is to establish and maintain a national infrastructure for the curation of digital scientific data. The infrastructure will provide services for easy and secure access to distributed storage resources, facilitate the creation and use of digital scientific repositories, provide large aggregate capacities for storage and data transfer, and optimize the utilization of the overall storage capacity.
Long-term objectives include:
- to operate a reliable national infrastructure for storage of digital data that is available for Norwegian research
- deploy and enable the development of services for data curation that add value to the existing e-Infrastructure
- provide capacity and services for the long-term storage of digital data
- to facilitate the establishment of digital scientific repositories in a broad range of scientific and technological applications
- to enable the Norwegian research community to (automatically) benefit from the advances in storage technologies
- contribute to the unification of interfaces to storage resources within Norway and abroad.
The sizes of digital data sets in science have increased considerably over the last years, especially in natural sciences like earth sciences and physics. For certain areas, high-resolution data is collected from real-time instruments (e.g., sensors) and large complex distributed databases are used. In other cases, large quantities of data are being generated during long computer simulations and visualization. For many of these cases, data cannot easily be regenerated and must be stored (archived) over longer periods of time. There is a clear need for a national storage infrastructure that develops strategies and policies for coordinated distributed data management, repositories, and related services.
The survival of digital (scientific) information depends on a hierarchy of constantly shifting technologies — hardware, storage media, operating systems, applications software and middleware. It also relies on tacit knowledge that is external to the data. A national infrastructure for digital data must be able to efficiently handle different types of data that originate from different sources and the ways in which data is stored, retrieved and manipulated. There is a need for temporary storage (for data that is stored locally or only for a short period), permanent storage (for data that is used repeatedly or data that cannot not easily be regenerated) and long-term storage (for data that is accessed infrequently, e.g., from completed projects). Each of these types of storage has different performance requirements. Factors that also must be taken into account are the data set sizes and their composition (e.g., granularity), the complexity of data sets, data formats, the value and expiration of data, and data access patterns.
The project will establish an infrastructure such that data sets will be transparently available across the national e-Infrastructure. In addition, the aim is to decouple (permanent and long-term) storage from other resources in the e-Infrastructure (e.g., computers) such that resources can be removed without interrupting the access to relevant data. A consequence of introducing such mechanisms will be that data that is needed (or generated) on a specific resource may be stored remotely. The mechanisms must be advanced enough such that data is transferred reliably throughout the national infrastructure, and data locality, data replication, and (network) latency hiding are properly taken care of.
The national infrastructure will enhance data reuse by facilitating and promoting the use of data repositories that can be used by geographically dispersed research groups that need to share data sets and databases. The national infrastructure will eventually also enhance data reusability by providing a variety of services for data curation, retrieval, (re)location, publishing, formatting, replication, etc.
The project will base its activity primarily on demonstrated user needs.
In 2007, the project aims to address important issues like initial specification of the infrastructure, choice of technologies, international developments, what level of curation is appropriate, thereby accumulating experience. The main activity in 2007 will be the investment in hardware to establish the initial infrastructure. The investments (and specification) must be such that the initial infrastructure can be expanded and upgraded in a cost-efficient manner in the coming years.
The establishment of the infrastructure must be in line with international developments and standards in the area of storage services and data management. The project will involve international competence in the activity. This includes for example the competence available in the other Nordic countries and the Nordic Data Grid Facility (NDGF).
