Computer Technology for Recovering Data Gaps in the Study of Socio-Economic Processes Using Online Surveys

Natalya MartyshenkoDepartment of International Marketing and Commerce Vladivostok State University of Economics and Service, Vladivostok, Russia, natalya.martyshenko@vvsu.ru

Abstract — With the proliferation of online polling services on the Internet, the number of researchers using survey data to explain and predict socio-­economic processes has grown significantly. In addition to online surveys, other methods for extracting information on the network have appeared. Multivariate statistical methods are used to process such data. Using these methods, researchers often encounter data skipping. The paper proposes a method for restoring the values of signs in multidimensional data obtained in the course of socio-­economic studies. The sample is assumed to include features measured on a nominal or rank scale. That is, the set of possible characteristic values is limited. The method is based on the idea of pattern recognition. The training sample is the part of the sample that does not contain gaps. As a classifying feature is a feature that is subject to recovery. The control sample is the part of the sample that has gaps in the classification feature. The task is reduced to the task of identifying the observations of the control sample with certain classes of the training sample. Classes are identified by comparing the observations of the control sample with the class standards formed according to the training sample. To increase the recognition accuracy, it is possible to calculate the weights of the pairs of occurrence of the values of signs of the training sample. The presented approach allows us to calculate estimates of the accuracy of data recovery. Recovery quality assessments are based on a rolling exam.

Keywords: data gaps, pattern recognition, multidimensional sampling, rolling exam

© The Authors, published by CULTURAL-EDUCATIONAL CENTER, LLC, 2020

This work is licensed under Attribution-NonCommercial 4.0 International

I. Introduction

Most researchers, who conduct studies of social and economic processes, encounter the problem of data gaps or non-response in the tables of object- property [1–2]. Otherwise this problem is called the problem of data incompleteness. Frequently the ejections also can be considered as data gaps [4, 5]. Ejections can include the data, which clearly contradict data of the entire sample. Moreover, the contradiction can appear not only with the one sign value, but also with the values of other signs in one observation. In both cases before the researcher there is the dilemma: whether reject the entire table of data? or by some means to correct an error (to restore data). Some contradictions (errors) can be revealed and corrected in the preliminary stages of the data analysis via contradictions in multidimensional data logical analysis. Exemplary means of the contradictions logical analysis are represented in the works [6, 7].

With a large quantity of the researched signs the number of gaps can be significant. To reject data frequently is undesirably for that reason, that on the basis of multidimensional data many tasks are solved, in which one-dimensional signs are used (for example, pitch continua), or some signs of multidimensional observations. In one task all signs are seldom used. If we speak about the questionnaire data, then the forms can include many questions, which serve for the data classification (for example, data on the social-­demographic portrait of respondents), and consequently they may be used for allocation of some data during solution of certain tasks.

II. Literary Review and Research Methods

The variety of situations and reasons for the appearance of data gaps resulted in many studies in this area. The extensive lists of such works may be found in the works of domestic scientists and the foreign scientists [8, 9]. A large quantity of methods required systematization of approaches and development of the methods classification [10–12]. In the work [13] the basis for classification of the methods to restore data there is the diagram, represented in Fig. 1.

Figure 1. Classification of methods to restore gaps.

In the above stated work there are basic principles of the most popular methods to restore data. It is possible to note that the new developed methods are usually entered in the represented diagram of classification.

It is possible to assert that the theory to restore data gaps is constantly developed and respectively there is the appearance of new algorithms and the known are modernized [14, 15]. It is connected with the fact there are cannot be an absolute developed algorithm, which could be used and provide the best results in all situations. Many researchers, proving the advantage of one or other approach or another or method, demonstrate merits of the method in the concrete example. But examples are also special cases and they do not prove the total superiority of one method over another. Despite the existence of numerous methods to restore data in the widely known packets on processing statistical data there are only simplest represented algorithms, which in many instances lack the required accuracy. I.e., the task of restoring the data now in many respects bears a research nature and it is used by specialists, who more or less provide the operation mechanism of the utilized algorithms. As a theoretical problem there is the remained problem to assess the accuracy of the results, obtained as a application results of restoration algorithms.

In this work it is proposed to consider the method of restoring the data, which can be used in the situation, when the most known methods are not applicable. The majority of methods to restore the data use the signs, measured in the scale of relations. When examining the social and economic processes, the data is received which are represented in different scales. We propose the algorithm, which make it possible to work with different signs. Certainly, the proposed algorithm does not also always guarantee the obtaining of required accuracy. Possibilities of the algorithm are always limited by available data and their latent structure. In addition to the algorithm there are several procedures to assess the accuracy of results, which makes possible for the researcher to make a decision about acceptability of the obtained result.

III. Results. Algorithm Based on Development of the Classified Data Samples

Algorithm is based on the assumption of random data gap occurrence in the table of “object — property”. For such assumption the abbreviation MCAR (missing completely at random) is often used. This assumption is used in most known algorithms. Mostly often such an assumption is valid and it can be checked by the known statistical methods. In the table of data it is possible to use data measured in different scales. The table of data is given in the form of a sorted table (Fig. 2).

The table includes + 1 column. First m columns  include values of he signed without the gaps. These signs will be called restoration signs. The column Y includes the sign wherein there are gaps. This sign will be called a restoration sign. First n0 include observations without the gaps. The following n1 lines have the gaps in the sign Y. That is it is required to restore n1 values of sign Y.

Procedure more effectively works during restoration of numerical signs, but with the sufficiently large amount of data (not less than thousand) it is possible to attempt to restore data of other types. For the simplicity we will consider that data are numerical. Let us examine the algorithm by stages.

First stage. There is transformation of all numerical values of the signs (including the sign Y) to the rank values (ranking operation). The signs measured in the nominal and ranking scales are not transformed. At that nominal signs should have some number of values (desirably less than 10). Otherwise nominal signs must be subjected to the preliminary processing, reducing them to the structured form. For that the procedures of processing qualitative data are used, as described in the work [16].

Figure 2. Formalized presentation of the data table.

The procedure of ranking consists in partition of the sign values for the equal intervals and the replacement of original values by rank (with numbers of intervals). The number of intervals r must be not very large (it is recommended to have 5), otherwise the intervals without values can appear, which is undesirable (although this situation, in terms of the algorithm, when it is permissible). It is not required to range the ranking signs and it is possible to use the available system of ranks. The ranked values will be denoted only by the letter with the stroke ().

Further on the sampling (the table of data) is divided into two parts which are considered independently further on. The first sampling, which can be called the “exemplary sampling” include first n0 line of the data table. The second sampling which can be called the “control sampling” included the following n1 lines of the data table.

Second stage. There is sorting of the “exemplary sampling” by the ranks of sign Y’. Let the sign Y has k ranks (classes). The sorted sampling is given in Figure 3. The sum of observation quantity by classes is equal to the volume of “exemplary sampling” n(1).

 (1)

It should be noted that in the table on Fig. 3 the table Y’ include k groups of repeated values.

Third stage. According to data of each column  there is calculation of the table with absolute conditional frequency rows of signs by classes. Each such table will include k lines (by the number of classes) and r rows (by the number of sign gradations ). For simplicity of the disclosure, the number of sign gradations  is equal and makes r. We should note that the frequency rows can be calculated both for the numeric signs and the rank and nominal signs with the end number of values.

Figure 3. Classified table of the signs rank values.

After that the tables are ranked in lines by division into the corresponding number of elements in class  Then the sum of elements in lines of each table will be the one. The ranked tables are given in Fig. 4. These frequency rows are sampled conditional distributions of signs X at the set values of Y.

Figure 4. Frequency rows of signs by the classes
of sign values.

Fourth stage. At this stage there is calculation of the vector for the “sample” line of sampling classes. The sample includes m parts by the number of signs X. Each part of the sample consists of r ranks by the number of discreet sign values X’. The standard sample is given in Fig. 5. We are to consider the rule to calculate the sample elements. Each part of the sample is calculated under the corresponding table which is given in Fig. 4. Consequently, the sample includes m parts. The total number of column in all tables in Figure 4 equal to . Accordingly the value of sample will make . To calculate each element we use the data from one column of the table. By data of each column we determine the maximum value and the number of line (class number) assigned to the corresponding element of the sample.

Calculation procedure of the sample elements may have a geometric interpretation. Fig. 6 provides the graphic interpretation of an example that illustrates calculation of one etalon part. All other parts are calculated in the similar way.

Figure 5. Class sample scheme.

When calculating the fifth sample etalon for the example given in Fig. 6, there is uncertain situation. The uncertainty is that the maximum is observed at once in two lines — the second and the third. In this case the preferable is the class (table line of conditional distributions), wherein the number of elements in class st is more. We assume that in our case ss2.

Figure 6. Graphic interpretation of the numeric sample calculation for one part of the class sample.

Fifth stage. At this stage there is comparison of multidimensional data of “control sample” with the etalon and forecasting the number of the class for the restored sign Y during observations of the “control sampling”. The procedure of comparison is demonstrated by the numeric example (Fig. 7).

The calculation is made at several steps:

1. There is formation of the auxiliary choice vector A. Values “1” indicate the number of interval for the sign value XJ;

2. There is formation of the auxiliary vector B, as the multiplication of elements in vector A and the model vector;

3. There is calculation of the class ratings as the number of values for each class in the vector of values for classes B;

4. There is forecast of the class for observation of the “control sampling” by the maximum class rating.

At the fourth stage there might be the uncertainty again. It occurs when maximum of the class rating is reached at once for several classes. In this case the preference is also given to the class of the most sampling volume .

Figure 7. Class sample scheme.

Sixth stage.

It is the final stage. If primary values of the restored sign Y are given in the ranking or nominal scale, the forecasted class determines the value of data gap. If primary values of the restored sign Y are set in the numeric scale, at this stage the forecasted values of the class numbers for the elements of the “exemplary sampling” are replaced by the average values of the mid sign values Y, calculated by the data of “exemplary sampling”.

The considered algorithm according to the classification given in Fig. 1, may be referred to the class of complex and global. This algorithm is related to the class of complex despite the complexity of calculations and multiple stages of calculation, but due to the fact that when using the algorithm for solution of the specific task the researcher encounters the problem of choice. It is required to set the number of intervals for the restoration signs and restored sign. It might require a minor experiment. In complex algorithms the researcher should know the operation principle of the algorithm. Assuming the possibility of the algorithm, to optimize the accuracy of the algorithm operation, it is necessary to have criteria to compare different variants of the developed decisive rule.

The considered method is not so sensitive to the selection of restoration signs. But the problem with selection of the restoration signs is still existing, as excessive “non-informative” signs can “clog” the useful information. It is recommended at first stages to use small number of restoration signs with their gradual increment. When selecting the numeric signs, it is reasonable to include firstly the signs with the large correlation and the restored sign. If we use the ranking signs it is possible to apply ranking coefficients of correlation. It is not required to exclude the sound mind of the meaningful sign analysis.

When restoring the data obtained in the course of social — economic researches, it may be very useful as the restoration ranking sign to include some generalized sign, formed according to presentations of the researcher about the social — demographic profile of the population groups. Such sign is formed on the basis of several signs.

It is possible to give an exponential example of the insufficient consideration of the sociological portrait factors. For example, during the sign restoration it may turn out that for observing the restorable sign the age of 70 and over is suitable. And it could be nice, but the algorithm operated correctly, and the mean errors were the minimum, but if to take into consideration that these are the student data with the full-time study, there are the questions concerning the correctness of this restoration.

IV. Discussion. Methods to Evaluate the Data Restoration Accuracy

To compare the results of accuracy to restore the data obtained by means of various methods or various data which is used for restoration some quality criteria are required.

Many authors consider that after the data restoration basic qualities of the sample should be preserved (evaluation of the density functions, average and sign dispersions). At the mere percent of the restored data these parameters are practically unchanged after inclusion of restored data in the sampling.

According to the author the most universal means to compare the results are evaluation of errors calculated by the method of a sliding examination [17, 18].

The method means that a decisive rule to restore data is checked on the data of exemplary sampling, which include both the restoring signs and the restored sign in full volume (complete data). The sliding examination procedure presupposes that from the exemplary sampling there is the consequent rejection of one observation, which is later restored by means of the rest observations. An error is to refer observations of the restored signs to other classes. The procedure is repeated n0 times. With sufficiently large volumes of the sampling (in thousand observations) it may take rather much time but with the use of the present computation machines it is not the problem. The procedure of a sliding examination results in computation of the restoration error matrix:

 (2)

where  — the number of restored sign values, from class i, referred during restoration to the class j.

That is the number of correct restorations will be equal to the sum of matrix diagonal elements:

 (3)

The sum of elements in the line will provide the volume of the exemplary sampling class:

 (4)

Quality of restoration may be valued by the rate  — the percent of restoration errors and indicators  — share of errors by classes:

 (5)

  (6)

A more detailed analysis of errors can be made by use of the regular restoration error matrix. The regular values of the error matrix are calculated by division of each line into the volume of the exemplary sampling volume . Then the sum of elements in each line of the matrix will be equal to the one. Elements of the regular error matrix will be :

 (7)

Errors by classes can be distributed unevenly and it is possible to refuse restoration of some data. Error restoration matrixes can develop the understanding of reasons for occurrence of non-responses, which is important for organization of the following researchers under the monitoring of social — economic processes. Many authors consider the problem of revealing the reasons for non-responses even more important than the problem of data restoration.

The disclosed procedure to estimate the level of error restoration is suitable for any time of the restorable signs. During the restoration of numerical signs it is possible to count on the exemplary sampling the dispersion of deviations for original values and estimations, obtained during the restoration of data. Such estimations can be calculated both on the entire sampling and the classes.

V. Conclusion

Before the procedure of data restoration, it is required to make a thorough analysis of possible errors in data and reveal ejections. For that we use the procedures automating this process [19]. At the large volumes of samplings and a great number of signs it is impossible to do without the special software.

It is required to note that the laborious procedure of data restoration is reasonable is the researcher uses in his work multidimensional statistical methods. Many researchers are limited by the research of one-dimensional signs and therefore it is not reasonable for them to restore the data gaps, as it is possible to make deviations in the results obtained for the complete data.

Some multidimensional statistical methods can be realized by program in terms of available gaps in the data. As the simplest example it is possible to calculate the co-variation matrix of data with the gap data. When calculating the co-variation matrix with the use of average values, they can be calculated by each sign independently in terms of the gaps. More complex examples can be given. But it can be noted that the account of gaps by program can result in the considerable complication of the program and with their use it is required to stipulate designations of the data unavailability in the data tables. At that for different signs (by type) there should be their own conditional designations. Therefore, such an approach is used in exclusive cases. We, for instance, have used it at the preliminary stages of data procession when disclosing the ejections and rough errors.

In relation to the considered algorithm as conclusions to this article it can be added that this method can provide more precise results as compared with some other methods due to extension of the used signs range.

The series of experiments on the model data has proved advantages of the method before others. In experiments we used widely the modeling program of multidimensional regular distributions. At present we continue experiments on the model data to reveal conditions and limitations for the use of this method. It is proved that any method much depends on the available data. If the data is “bad”, the most perfect methods are powerless.

Also, it can be added, that the software means developed by us either include inbuilt blocks that make possible to realize experiments or include the parameters automating the experimental work, facilitating the work of the researcher.

The developed software has been approved on large data arrays collected in the researches which are devoted to the analysis of social — economic problems and social well-being of population in the region of Primorye (Russia).

REFERENCES

 [1]   Zloba, E. and Yatskiv, I. Statistical methods for recovering missing data, in Computer modeling and new technologies, 2002, vol. 6, no. 1, pp. 51–61

 [2]   Belov, Yu. S. and Frolov, P. V. Classical methods of working with data with omissions, in Electronic journal: science, technology and education, 2015, no. 1, pp. 25–32

 [3]   Zhang, P. Multiple imputation: theory and method, in International Statistical Review, 2003, vol. 71(3), pp. 581–592

 [4]   Riphahn, R. T. and Serfling, O. Item non-response on income and wealth questions, in Empirical Economics, 2005, vol. 30(2), pp. 521–538

 [5]   Martyshenko, S. N. and Martyshenko, N. S. Method for detecting errors in empirical data, in News of universities. North Caucasus Region, 2008, no. 1, pp. 11–14

 [6]   Pawlak, Z. Rough sets and intelligent data analysis, in Information sciences, 2002, vol. 147(1), pp. 1–12

 [7]   Martyshenko, N. S. and Martyshenko, S. N. Technology to improve the quality of data in the questionnaire, in Practical marketing, 2008, no. 1, pp. 8–13

 [8]   Zagoruiko, N. G. Applied methods of data analysis and knowledge. Novosibirsk: Institute of Mathematics, 1999, 270 p.

 [9]   Johnson, T. P., et al. Culture and survey nonresponse, in Survey nonresponse, 2002, pp. 55–69

[10]   Andridge, R. R. and Little, R. J. A review of hot deck imputation for survey non‐response, in International Statistical Review, 2010, vol. 78(1), pp. 40–64

[11]    Reilly, MData analysis using hot deck multiple imputation, in The Statistician, 1993, pp. 307–313

[12]    Myers, T. A. Goodbye, listwise deletion: Presenting hot deck imputation as an easy and effective tool for handling missing data, in Communication Methods and Measures, 2011, vol. 5(4), pp. 297–310

[13]    Zangieva, I. K. The problem of omissions in sociological data: meaning and approaches to solving, in Sociology: methodology, methods, mathematical modeling, 2011, vol. 33, pp. 028–056

[14]    Osipov, P. A.Verkhoturova, M. V., et al. Filling in the gaps in the input using the non-parametric identification algorithm, in Siberian Journal of Science and Technology, 2018, vol. 19, 4, pp. 589–597

[15]    Stashkova, O. V. and Shestopal, O. V. The use of artificial neural networks to restore gaps in the array of source data, in News of higher educational institutions. North Caucasus region. Technical science, 2017, vol. 1 (193), pp. 37–42

[16]    Martyshenko, S. N. and Martyshenko, N. S. Modern methods of processing marketing information, Vladivostok: VSUES Publishing House, 2014, 148 p.

[17]    Press, J. The role of Bayesian and frequentist multivariate modeling in statistical Data Mining, in Statistical Data Mining and Knowledge Discovery, 2004, pp. 1–14

[18]    Sarychev, A. P. The Scheme of Sliding Examination for Optimal Set Features Determination in Discriminant Analysis Problem, in Journal of Automation and Information Sciences, 2002, vol. 34(11)

[19]    Martyshenko, S. N. Analysis of monitoring data of socio-­economic processes in municipalities, in Information technology modeling and management, 2012, vol. 6 (78), pp. 506–512