Ekaterina Poymanova, Institute of Information Systems and Information Security Saint-Petersburg State University of Aerospace Instrumentation, St. Petersburg, Russia, e. firstname.lastname@example.org
Tatyana Tatarnikova, Institute of Information Systems and Geotechnology Russian State Hydrometeorological University, St. Petersburg, Russia, email@example.com
Abstract — The article discusses data storage as a separate structure with certain properties and characteristics, as well as a data storage system as a storage management system.
Keywords: data storage system, data storage, storage management system, control algorithms
© The Authors, published by CULTURAL-EDUCATIONAL CENTER, LLC, 2020
This work is licensed under Attribution-NonCommercial 4.0 International
The current level of information technology development, as well as the active growth in the information amount, set new conditions for the data storage problem. On the one hand, there is a wide range of diverse storage media and data storage systems, and users have different storage requirements. On the other hand, there has been an active increase in the amount of heterogeneous information to be stored . Moreover, the storage process does not take into account the correlation of the characteristics of the stored data with modern storage technologies.
Existing virtualization technologies allow to implement various architectural solutions without reference to specific equipment. This allows to consider the data storage system as a hierarchical structure containing a physical data storage and data access mechanisms, without connection to an architectural implementation [2, 3].
The article considers the data storage as a separate structure with certain properties and characteristics. Data management and virtualization are based on working with metadata, starting with the physical address of the file on the data media. The physical data storage can contain various types of data media, and the distribution of files among them should depend on certain characteristics of the data.
The article also discusses the issue of forecasting the extension in storage capacity, which allows system administrators to increase it in a timely manner.
II. Literature Review and Research Methods
A. The Data Storage System Hierarchical Structure
Given the growing volume of information stored, the necessity of reducing the cost of storing a data bit is obvious. This can be achieved not only by increasing storage time and increasing data density, but also by efficiently utilizing available physical storage resources, such as capacity. It is also advisable to take into account the need for timely extension of storage capacity to ensure data integrity.
Definition of used concepts:
• Data Storage or Physical Data Storage — hardware, which is a set of data media.
• Capacity — the amount of memory that a data file occupies in a data storage system.
Taking in consideration the existing virtualization technologies, the storage system is presented as a containing a physical data storage (level 1), data access tools such as Database Management System (DBMS) (level 2) and mechanisms that implement the user’s work with data (level 3) . This structure is shown in Fig. 1.
Figure 1. The data storage system hierarchical structure.
As it can be seen, the data storage itself (1st level) can include a variety of data media such as Redundant Array of Independent Disks (RAID), streamer, etc., and therefore store only the corresponding data types.
B. Analysis of Storage Intervals of Information and Available Media
Each medium has its own service life, and different types of information have their own requirements for the storage time. Table 1 shows the intervals for the storage of information that were selected based on manufacturers’ guarantees and expressed in decimal order of time. These intervals correspond to such types of stored data as initial data, backups and archived data and data for transfer to next generations.
Table 1. The Correspondence of Periods, Data Types, and Stirage Media Types
Information storage interval, years
Used storage media
Used for real-time processing. Minimum access time required.
RAID arrays, solid state drive
Backups and archived data.
It is required to ensure fast data recovery in case of their loss, however, the access time can be significantly longer than the access time for the source data.
Magnetic tape, automated libraries based on optical disks.
Data for transfer to next generations
Promising technologies: glass disk, tungsten disk, records in bacterial DNA, etc.
The physical data storage was formalized in the form of a mathematical matrix, where each cell of the matrix is a data media volume or separate data media of the corresponding storage tire. Each cell is designed to store files with certain metric characteristics, which are taken from the metadata (Fig. 2) .
Figure 2. Matrix storage.
C. Metadata Classification
A distinctive feature of digital storage is, firstly, the necessity of using special equipment to perform data compression, encoding, writing and reading operations, and secondly, the using of information about stored data — that is, metadata. It can be divided into organizational and specialized.
Definition of used concepts:
•Metadata — a set of additional information necessary for the search and decryption (interpretation) of source information.
•Organizational Metadata — the information about the location of data in the repository, time, attachment to the user, location on the medium, which is formed during storage and is used to find the necessary data.
•Specialized Metadata — the information needed by a file or operating system to search for data on a medium and convert it into a form that is understandable to human perception.
Metadata is generated during the recording and storage of data, and the encapsulation process takes place — the data is “overgrown” with additional information as it goes through the storage stages.
At the first stage, it is assumed that the user has information in analog form. When a message is formed, the meaning is converted into symbols, and when written, the characters are converted into data bits (compression, encoding
of information and physical recording on the medium occurs). Thus, the conversion of analog information to digital. When reading and interpreting, the data is converted back to meaning. At each stage, metadata is generated, which together must identify the stored data according to the meaning of the information contained and the location of the medium itself, and the data on this medium.
Metadata can be classified according to the following criteria:
according to functional characteristics:
• primary — are declared by the user;
• address — contain the address of the file on the medium and the address of the medium in the storage;
• specialized — contain characteristics of data files and characteristics of the information carrier;
according to necessity of the formation:
• obligatory — necessary for identification and reading of information;
• variable — declared by the user;
according to the place of storage:
• stored on media along with data;
• stored in the metadata database.
D. Characteristics of Data During Recording and Storage
Based on the analysis, there were considered the following characteristics of the stored data on which the storage technology for them depends: guaranteed storage time, size of the logical data block, file size, frequency of access to it.
The guaranteed storage time of the data depends on the type of media. Therefore, if the guaranteed storage time is longer than the estimated storage time, it is necessary to provide for automatic data rerecording in the data storage system.
The size of the logical data block is a characteristic that determines the number of blocks in the file and, accordingly, the number of accesses to the storage system when recording/reading, rerecording, backing up, the amount of metadata, which affects the consumption of storage resources.
File size is a data characteristic, that largely determines the speed of the file system. The file size varies depending on the format (text, graphics, multimedia).
The speed of the file system also depends on the selected size of the logical block: the smaller the logical block, the lower the speed of reading the file, since more time is spent on finding the logical blocks. At the same time, because of the large logical blocks, disk space is lost, because the blocks are underfilled, as a result of which the same data forms files of different sizes (Fig. 3).
Figure 3. Comparing files with the same amount of data and logical blocks of different sizes.
Obviously, when organizing storage and access to large files, it is advisable to use file systems with a large logical data block size and vice versa — for small files, for example, text files, it is advisable to use file systems with a small logical block size.
The frequency of accessing data is one of the defining characteristics in a storage system. Data that is often accessed requires media with minimal access time. For data that is rarely accessed (for example, archive data), such requirements for the medium are not presented.
E. Physical Data Storage Management System
Consider data storage system as a management system for a physical data storage (Fig. 4) .
Figure 4. Physical data storage management system.
The purpose of the management system is the efficient utilizing of available storage capacity, as well as its timely extension.
The management object is a data storage.
A control mechanism is a set of algorithms for managing a data storage and making a forecast its extension.
Tasks of the management system: distribution of files in the storage with maximum efficiency of using the available capacity and predicting the consumption of capacity for its timely extension.
A storage state is information about the state of a physical data storage, which is necessary for adjusting control algorithms.
The external environment is the data storage system, from which resources are transferred to the control object (physical data storage), and at the output of the control object, the physical addresses of the stored data transmitted to the DBMS.
The resources transferred to the control object are divided into physical and informational.
Physical resources (storage capacity) are characterized by the type of record, the time of guaranteed data storage and the file system used.
Information resources, that is, files for recording and storage are characterized by:
t — the required storage time, in years;
S — logical data block size, in bytes;
f — file size, in bytes;
λ — frequency of access to files, requests per hour.
The criterion for the effectiveness of the storage management system is the coefficient , while where Vfiles is the capacity occupied by files (taking into account empty sectors of the disk); Vdata is the capacity physically occupied by data.
Capacity management of a physical data storage is based on algorithms for placing and migrating data across tires of the storage hierarchy .
Algorithms, in turn, are based on the following mechanisms:
• vertical allocation mechanism — this is the choice of storage tire depending on the storage time;
• horizontal allocation mechanism — represents a choice of a file system with a certain size of a logical data block (for RAID level) or type of archive data media;
• dynamic allocation mechanism is a mechanism of data migration by tires depending on the frequency of access to them.
Monitoring of the state of the storage and its timely extension is also necessary. For this there must be built and periodically adjusted the forecast for capacity extension.
F. File Allocation and Migration Mechanisms
There were considered such characteristics as the estimated file storage time, file size and frequency of access to files.
At the first stage, the choice of the data storage tire is performed. The implementation of such a distribution can be based on the analysis of organizational metadata containing information about the data type. In this case, the data type corresponds to the estimated storage time.
For example, the type can be specified in the file name through the dot before the extension (poymanova.bck.txt). Thus, when saving, the file attribute (F) must be specified: ind (initial data), bck (backups) or ngd (next generation data).
In general, three levels of data storage are proposed: RAID for ind data, automated libraries for bck files, and long-term storage media for ngd files.
The second stage is the horizontal allocation of files. It is proposed to analyze metadata containing information about the size of the stored files and place them on various media or media volumes within each level.
At the RAID level, it is proposed to place files in different RAID volumes; each volume should have its own file system with a certain size of the logical data block. This avoids underfilled logical data blocks on disks.
At the lower levels of the data storage, it is proposed to divide the capacity by type of media, for example, a tape drive, DVD, BD for the level of automated libraries and an M-disk, a glass disk and, in the future, DNA — for long-term storage media.
After the initial distribution of files when recording to the data storage, it is proposed to migrate between the tires based on statistics on the frequency of accessing the files. Such migration allows to save resources at various tires of physical data storage.
Fig. 5 presents a general algorithm that combines all of the above mechanisms.
A forecast model for the differential storage capacity extension is proposed. This model is based on the analysis of states of the physical data storage and represents a behavior pattern of storage systems based on the forecast of the behavior of each cell of the storage matrix .
When building a model, “zero state” and “current state” storage patterns are built first.
Figure 5. Allocation and migration algorithm.
A “zero state” pattern is a data storage structure template that sets limits on the values of the characteristics of its cells.
G. Forecasting Mechanism
The “current state” pattern determines the current state of the data storage. Metadata is involved in the construction of the “current state” pattern, indicating the size of the files stored in the data storage system and the estimated time of storage.
Each matrix cell has the following characteristics:
• maximum capacity value (Vmax), which corresponds to the capacity of the medium used (medium volume);
• limit value of capacity Vlim (Vlim<Vmax);
• current capacity of each cell Vcurrent.
In addition, boundary values of the frequency of accessing (λ) files must be set, overcoming of which the files are migrated by system tires (Fig. 6).
Figure 6. “Current State” pattern.
Since the data is accumulated in the storage system, it is necessary to count the accumulated data over time and determine the points of overcoming the limit values of the storage matrix cells capacity (Fig. 7).
Figure 7. The points of overcoming the limit values of the storage matrix cells capacity.
The limited and maximum capacitance values are presented in equations 1 and 2.
where tlim mn — time to reach limited cell capacity mn;
tmax mn — time to reach maximum cell capacity mn;
f(t) — incoming data function;
T — partition step equal to the unit of the minimum selected time scale.
The forecasting task is to find the timeline point at which the limited capacity and the maximum capacity of each cell are reached.
Mathematically, the values of tlim and tmax are difficult to calculate, because the function f(t) does not have a primitive. Therefore, it is proposed to calculate these parameters programmatically by the substitution method.
An important objective of the research was also the choice of a forecasting model.
The incoming data stream to the storage system has a number of features. Firstly, it is heterogeneous and incorporates data of various types: multimedia, text, speech, etc.
In addition, considering the incoming stream, there can be seen some pulsations in different time scales. Such pulsations are obtained as a result of the uneven activity of users of any network or data storage system associated with working hours, weekends, holidays, vacation periods or other events.
In the case when these pulsations represent a certain fractal (self-similar) structure, it is possible to make a long-term forecast of the amount of the incoming data stream at different time scales. This feature allows to plan a differential storage capacity extension .
Self-similarity properties are:
• slowly damped dispersion;
• long-term dependence;
• the presence of a distribution with heavy “tails” [10, 11].
Depending on the properties of the incoming data stream to the storage system, there were chosen two forecasting models:
• general linear model for the case of an incoming data stream without a self-similar structure;
• autoregressive integrated moving average (ARIMA) model for a self-similar incoming data stream.
To implement the proposed mechanisms for allocation and migration, a programmable logic controller is required that works according to the proposed algorithms and automatically distributes and migrates data inside the physical data storage.
This controller is directly connected to the physical data storage and is independent of logical database structure and used DBMS (Fig. 8).
To implement the forecast mechanism, an application was developed. This application helps automate the process of predicting the capacity extension of each cell of the storage matrix .
The application must perform the following operations:
• the formation and display on the screen of the “zero state” pattern of the data storage system in accordance with the set characteristics and restrictions;
• the formation and display on the screen of the “current state” pattern of the data storage system in accordance with the metadata about the files stored in the data storage;
• plotting and displaying on the screen graphs of the time series of the incoming data stream in accordance with the specified time scales;
• building a general linear model for forecasting storage capacity growth and calculating the limited and maximum capacity building time when the incoming data stream does not have self-similarity properties;
• plotting and displaying on the screen a graph of the probability distribution function, calculation of self-similarity indicators;
• making correlation and autocorrelation models;
• construction of the ARIMA forecast model based on specified parameters, calculation of the limited and maximum capacity growth time based on the constructed model.
Figure 8. Storage structure with controller.
Figure 9.Shows several screenshots or the application.
Discussion of the research results showed that it is necessary to pay attention to the fact that the programmable logic controller must ensure the distribution of files across the physical storage in real time.
For this, in the future, various machine learning methods will be compared to solve the problem of distributing the file stream to the storage in accordance with a given storage matrix. Further, it is planned to develop software that implements these functions.
In the future, it might be possible to combine all control mechanisms, including forecasting capacity increasing in a single hardware and software unit.
The results of the research showed that due to the increasing volume of stored data, it was necessary to look at the issue of data storage from a new perspective. It is no longer enough to simply increase the capacity of the physical storage; it is necessary to effectively use the available resources.
The proposed algorithms and mechanisms for placing and migrating files, as well as a forecast model of the differential capacity extension, when implemented using a programmable logic controller, will help to economically use the available resources of the data storage, which will save, including the financial resources of organizations that carry out storage.
 Data Growth, Business Opportunities, and the IT Imperatives. Available at: https://www.emc.com/leadership/digital-universe/ 2014iview/ executive-summary.htm
 The SNIA Shared Storage Model. Available at: https://www.snia.org/sites/default/files/SharedStorageModel_v2.pdf
 Information Storage and Management. 2nd Edition. New Jersey: John Wiley & Sons Inc., 2016, 544 p.
 Poymanova, E. D. and Tatarnikova, T. M. Tiered Data Storage Model, in Wave Electronics and its Application in Information and Telecommunication Systems (WECONF) 3–7 June 2019, DOI: 10.1109/WECONF.2019.8840589
 Sovetov, B. Ya., Tatarnikova, T. M., and Poymanova, E. D. Organization of multi-level data storage, in Informatsionnoupravliaiushchie sistemy [Information and Control Systems], 2019, no. 2, pp. 68–75 [In Russian], DOI: 10.31799/1684–8853–2019–2–68–75
 Sovetov, B. Ya., Vodyaho, A. I., Dubenetskij, V. A., and Tsehanovskij, V. V. The Architecture of Information Systems, Moscow: Publishing Center «Akademija», 2012, 288 p.
 Tatarnikova, T. M. and Poymanova, E. D. Algorithms for Placing Files in Tiered Storage Using Kohonen Map, in Selected Papers of the IV All-Russian scientific and practical conference with international participation «Information Systems and Technologies in Modeling and Control» (ISTMC’2019) Yalta, Crimea, May 21–23, 2019, pp. 193–202
 Tatarnikova, T. M. and Poymanova, E. D. Differentiated capacity extension method for system of data storage with multilevel structure, in Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2020, vol. 1, no. 1, pp. 66–73, DOI: 10.17586/2226–1494–2020–20–1–66–73
 Poymanova, E. D. and Tatarnikova, T. M. Models and Methods for Studying Network Traffic, in Wave Electronics and its Application in Information and Telecommunication Systems (WECONF) 26–30 Nov. 2018, DOI: 10.1109/WECONF.2018.8604470
 Zwart, A. P. Queueing Systems with Heavy Tails, Eindhoven University of Technology, 2001, 227 p.
 Kutuzov, O. and Tatarnikova, T. Evaluation and Comparison of Classical and Fractal Queuing Systems, XV International Symposium «Problems of Redundancy in Information and Control Systems», St. Petersburg, Russia, September 26–29, 2016
 Poymanova, E. D., Tatarnikova, T. M., and Yagotintseva, N. V. The Forecast Application for Capacity Extension of Data Storage Systems, Computer Registration Certificate RU 2019661945, 12.09.2019. Application for registration № 2019619010 22.07.2019