Sergey Martyshenko, Department of Mathematics and Modeling Vladivostok State University of Economics and Service
Vladivostok, Russia, firstname.lastname@example.org
Abstract — In recent years more and more scientific research used the data in Internet. Most Internet information — is high-quality data. The establishing feedback with population also forms mostly high-quality information. Quality data sources are answers of the respondents to open questionnaires. The task is complicated when researching big data. Therefore, a wide range of researchers is interested in using special software tools for processing quality data. Due to that methodological approaches and instrumental tools to process quality data have been developed. The main idea of processing quality data in the surveyed system is automation of the procedure to develop typologies. The system provides opportunities to analyze quality data of various complexities. Complex quality data can be presented as a composite sign consisting of several simpler statements (linguistic units). The system to analyze quality data allows highlighting two levels of the quality data typologies. The surveyed system to analyze quality information contains elements of intellectual systems. When working with the system there is the system training and knowledge base formation. All actions of the researcher when developing typologies are stored in the special database. When monitoring social-economic processes such approach allows significantly reducing (at times) the time to process qualitative data. In processing of complex linguistic units simplification of the quality data analysis is based on compilation of the keywords list for individual typologies. The system was tested in numerous researchers of social-economic processes and proved to be highly effective. The system big advantage is its relative ease of use. Another important system advantage is its accessibility for a wide range of researches. The software complex of processing quality data is realized in EXCEL form.
Keywords: quality data, big data, typologies, software complex, intellectual systems, knowledge base, structure analysis
© The Authors, published by CULTURAL-EDUCATIONAL CENTER, LLC, 2020
This work is licensed under Attribution-NonCommercial 4.0 International
In recent years more and more scientific research used the data in Internet. Most Internet information — is quality data. The establishing feedback with population also forms mostly high-quality information. Quality data sources are answers of the respondents to open questionnaires . The task is complicated when researching big data. Therefore, a wide range of researchers is interested in using special software tools for processing quality data.
Automation of sociological data formalized analysis uses various software tools. Among these packages there are: Atlas.ti, MAXQDA, NVivo . Jointly with these programs in Russia own quality data analysis programs are developed .
However, the use of such software needs special training. Most researchers for their work find many functions excessive. Therefore, it is relevant to develop simpler software tools that can be mastered easily and sufficient for a wide range of researches wishing to expand their capabilities through the quality data analysis.
II. Literary Review and Research Methods
Software tools to analyze quality data are constantly improving as the theory evolves. The review of quality data analysis methods is given in the work . Principles of quality data formalized analysis are disclosed in the work [5–7]. Algorithms to process quality data are given in works [8–10]. Technologies to analyze quality information are viewed in works [11, 12]. Systems of quality data intellectual processing are considered in works [13–15]. Opportunities to reveal the empirical structure based on the formalized quality analysis of sociological data are identified in works [16–17]. In works [18, 19] there are methods to analyze quality data from social networks.
Figure 1. Scheme of the quality data procession information technology.
Instrumental tools to analyze the quality information are based on the typological analysis .
The typological analysis — is a meta-method to analyze data, a set of methods to study the social phenomenon, making possible to distinguish social relevant, internally homogeneous qualitatively different groups of empiric objects characterized by type-forming features of the different nature and interpreted as carrier of different phenomenon existence types .
The typology subject is a set of social phenomenon main characteristics which is responsible for classification of empirical objects to the same group type.
The typology basis is a set of judgments (statements) about proximity (similarity, likeness) of the objects, information carriers of the studies social phenomena (events, processes).
We are to view basic elements of the developed software complex to process quality data. The data is supposed to result from answers of the respondents to open questionnaires. The software complex is to convert a set of quality data (a column of initial data) to a structured view for which quality analysis methods may be used (Fig. 1). The software complex is developed as an application to EXCEL.
The used technology assumes large enough amounts of data needed to generate sustainable results. The software complex is to process quality data of various complexities. We are to view the program capabilities when processing data by complexity levels.
First level of data complexity. The simplest case is when it is needed to bring to a uniform type of answers with a single-digit correspondence. For example, writing the name of settlements. Different respondents can write down the name in different ways. This level includes also monosyllabic simple statements consisting of one or more words, describing characteristics of the surveyed objects. Such answers are “simple”.
The basic idea of quality data processing is based on the assumption that respondents formulate similar or repetitive answers. The initial data is converted preliminary in a table of unique values. For repetitive values the frequency of similar values is calculated. In this view the data table becomes more visible. All data conversions are in the table of unique values. The main data processing operation is the typing operation. The typing involves three things: “typology detection”, “data correction”, “compression of the unique values table”. Typology identification involves a comprehensive data analysis. To analyze comprehensively the table of unique values it is very useful to apply EXCEL standard operations: sorting, searching, and filters.
As to performance result the typing operations can be grouped by similar statements. Often a common group name is entered which may match of the most common values in the group. In this case the typing results in a derivative value: “group name (initial value)”. The group name is useful to sort values by groups. Data systematization involves multiple typing operations. More than one session may need to convert data into a structured view. If one session fails to form the final data typology, the next session work can be continued from the same place.
All data corrections during the program session are automatically stored in a special knowledge base table. Such table is “Substitution dictionary”.
Typically the data collection is not limited to one cycle. That is after a while the data collection is repeated. In this case the use of “Substitution dictionary” significantly reduces the data procession time. The effect is achieved due to new situations which have been previously processed. Data is corrected through “Substitution dictionary” semi-automatically. When activating the “Substitution dictionary” the program offers only the suitable correction variant in a separate column of the unique values tables. The final decision on correction is made by the researcher. For that the researcher previews the list of proposed corrections and only after that the decision is made on the own consent with the offered correction program. In some cases based on the new data the researcher can change the opinion on an earlier decision.
In some cases when procession results in too many different response groups it is allowed to use an additional (higher) grouping level. This level is determined in a separate column of unique values.
Second level of data complexity. The second complexity level includes answers of the respondents providing several response variants in one answer. Individual responses are separated from each other by a divider sign (e. g. “;”). Such answers (values) are “composite responses”. When forming the table of unique answers each composite response is converted into a few simple ones.
The second complexity level results from the question allowing a greater freedom in formulation of answers. In practice respondents often fail to formulate clearly a composite answer. Therefore, the researcher when processing such data analyzes the complexity of answers. If after division in simple answers there are still the answers in the unique values table which can be also divided into simpler ones, a dividing sign separates them. Such operation is “simplification of text answers”. Then in the next session work with this program such a composite response will be presented with a large number of simple answers.
Already at the second complexity level the answers may include excessive information which is not relevant to analyze the questions. For instance, if the answer contains a fragment “in my opinion” or “above all”, such text is not significant in terms of the response comprehensive analysis and it can be easily deleted without losing the response sense. The text fragments with excessive information are accumulated in a special dictionary — “Excess information dictionary”. When activating the “Excess information dictionary”, the researcher is not to find excessive information on the own. For phrases, included in this dictionary, there are extra characteristics: “contains”, “starts on”, “ends on”, or “exact value”.
All excessive information in the dictionary is highlighted color in the unique values table which facilitates the work of researcher on unification of text responses. During the work the researcher may always complete the “Excess information dictionary” with new text fragments.
After preliminary processing the text with composite answers comes down to working with simple answers.
Third level of data complexity. At this level answers to respondents excludes restrictions. Respondents can express their opinions in any form as several sentences of an arbitrary test. The preliminary stage of bringing complex answers to composite simple is a more difficult task. To automate the researcher’s work when processing such answers all tools of the knowledge base are used. To improve the researcher’s work effectiveness on processing such data is possible due to the “Keywords dictionary”. This dictionary is formed automatically after determining a certain preliminary typology, which is continuously specified during the data analysis. The “Keywords dictionary” has a rather complex structure accounting frequency characteristics. The “Keywords dictionary” is to work out suggestions on data correction. However, the researcher’s work needs a deeper knowledge of the tools on data systematization offered by the system.
All dictionaries are stored in the same Access file. The dictionaries store the experience accumulated by the researcher during the work and form the knowledge base.
For a better understanding of using the software we are to consider two examples of processing quality data resulted from sociological surveys.
For example, in the questionnaire on surveying the process of cultural environment formation in the student youth, the respondents were offered to answer the following open question: “Show signs of a cultural person (at least 3)”. More than 500 students were surveyed. In total, the students in their answers gave more than a thousand variants of the cultural person characteristics (in terms of the repetitive answers). The responses form a table of unique values. Variants are the same, if they are completely matched. After processing quality data by the special computer technology 282 different answers were allocated to varying degrees. The content analysis by unique values formed 8 groups (typologies) of similar answers characterizing the cultural person according to the student (Fig. 2).
Figure 2. Assessing the distribution structure of the cultural person characteristics by the student response data.
When processing quality answers of the respondents the “good manners” response group included 49 unique answers in different forms. Individual responses were more common, others less often. In the most frequent answers of this group there are such cultural person qualities as: “polite” — 125 answers, “educated” — 81 answers; “complies with etiquette rules” — 38 answers; “punctual” — 24 answers and the direct answer, coinciding with the group name — “good manners” — 24 answers. Such answers make 89% answers referred to the group of “good answers”. Other answers of this group are quite close in meaning to the above responses. For example, it is reasonable to refer to this group the following answers: “compliance with behavior rules”, “politeness”, “decent behavior”.
The answers referred to the following group of “respect to the surrounding” also may be referred to the group answers of “good manners”. But answers of this group are narrower. About 60% answers referred to this group are the following: “tolerance” — 42 answers; “respect to the surrounding” — 18 answers; “tactfulness” — 17 answers; “restraint” — 12 answers; “respect for the elders” — 12 answers. Single answers of this group, for example, include such responses as: “friendliness”, “amiability”, “self-control” and others.
When determining the cultural person qualities the students attach great importance to the desire for self-development — 22% respondents. The desire for self-development is noted in answers of two response groups: “desire for self-development”, “desire for spiritual development”. There is a slight difference between these groups of responses. If in a more general group “desire for self-development” there are such answers as “literacy”, “education”, “intellectuality”, then the group “desire for spiritual development” include the answers concerning literature, art, knowledge of the native history.
The following group of answers also concern education. The answers of this group combine all sorts of synonyms and literate speech characteristics. This group of answers is well separated from other groups of answers. Among important qualities of the cultural person the students named “helpfulness”.
By results of the students’ survey it may conclude that students adequately determine characteristics of the cultural person. Analysis of the students’ responses proved that in their answers they marked out all characteristics of the cultural person according to popular sociologists [22, 23]. Certainly, specific formations may differ in their emotional coloring and diversity, but the final meaning was taken correctly. The typology allows quantifying the relevance of the cultural person features, according to modern students. If the cultural person characteristics are stable, then qualitative estimates may change over time.
We are to look at a more complex example of processing quality data.
For several years the social well-being of people in the Primorye region of Russia has been researched. Over six thousand respondents were interviewed during the study period.
Among other questions the respondents were offered to reply to the following open answer: “Name the social-economic, political and other problems that most bother you and people around you”. When answering the open question the respondents expressed their opinion in an arbitrary text form. On average the respondents reported three problems. After professing there were more than five thousand unique response variants. To analyze the answers two levels of the typology were introduced.
At the first typology level 80 groups (subclasses) of response variants were formed including very close answers. At the second level the selected groups of answers were combined in 18 response classes close to the subject.
The first level of the typology does not practically distort the primary information. At the second level the data combination is subjective and it does not depend on comprehension and interpretation of problems by the researcher himself. Different scientists on the same data may give different classifications. But the results are quite similar by the structure (but for names of the classes, which are assigned by the researcher himself). There might be differences in the structure of classes, as some response groups are adjacent and can be referred to one class and another. The structure of response variants formed during processing the primary data and given by people in the Primorye region on the problems which are mostly relevant to them is presented in Table 1.
Table 1. The Structure of Response Data Given by People in the Primorye Region Which are Mostly Relevant to Them
|2||Poor life quality||0,11|
|3||Care of the future generation||0,1|
|4||Infrastructure and landscaping||0,09|
|6||Social tensions and poverty||0,06|
|7||Corruption and crime in the authorities||0,06|
|8||Environmental problems and nature protection||0,06|
|9||Employment and occupation||0,05|
|10||Low social security||0,05|
|11||Dissatisfaction with work of the authorities||0,05|
|12||Crime and personal security||0,05|
|15||Access to leisure and entertainment||0,01|
|17||Poor quality of goods and services||0,01|
|18||Global disasters and catastrophes||0,003|
Each group has its own structure of typical responses. As an example, we are to view the most relevant group (class) of answers — “Price policy”. This class combines 10 group answers of the respondents (Fig. 3). Many people in the Primorye region note the urgency of the housing problem (23%). Residents of the region believe that utility prices are too high (17%).
The life quality problem, according to the regional population is the second most important problem after problems with prices. In this class of problems highly leading are the answers related to the quality of medical care. Answers of “poor quality medical care” and “high prices on medical care” in this group are more than half (52%).
When analyzing a specific problem we can refer to the tables with the decoding of each response group. That is to view specific answers of the respondents.
The problem comprehension — is prerequisite for finding an acceptable solution.
Frequency distribution range of the answers given by residents of the Primorye region in the class of answers “Price policy”
Most social-economic problems do not exist independently, but they are interrelated. It is possible to access the relationship between the problems and determine the strategy of their complex solution with using a cognitive approach. In this research by the typology results a cognitive model was developed. The viewed software complex includes the modules providing automated developed of cognitive models. But the development of cognitive models demands higher qualification of the researcher.
The software complex contains software tools to improve efficiency of the researcher’s work when processing quality data. But for that the software complex includes the programs to analyze the previously systematic data. These programs are to analyze the structures and hypothesis about the cause-consequences dependencies which are typical to the surveyed objects or phenomena. The software complex is continuously developed. Recent developments include software tools to automate the development of cognitive models .
The software complex was tested at analyzing quality data obtained during the research of various economic systems. The programs were used at studying consumers of the regional tourist complex, research of the regional population social well-being and other social — economic system. During the surveys the knowledge is constantly updated.
The knowledge base is used especially efficient at repeated or systematic researches of particular social — economic systems. The knowledgebase can be transferred to other researchers engaged in close areas surveys which can make it much easier for them to analyze quality data.
In recent years more and more researchers use quality data in their works. There are the growing information services provided to population. To make managerial decisions it is needed to analyze opinions of various social groups. It results in larger volumes of quality data. Therefore, it is required to develop computer technologies to process quality data.
The big advantage of the proposed system is its relative easy use. At that for most users minimal knowledge is enough at first stages to start working with the system. Instrumental system tools can be studied in stages. The software complex is realized in EXCEL application that makes it available to most users applying quality data in their work.
 Martyshenko, S. N. and Martyshenko, N. S. Modern methods to process marketing information, Vladivostok: VSUES Publishing House, 2014, 148 p.
 Kanygin, G. V. Analysis of qualitative data is a scientific method, in St. Petersburg sociology today, 2013, vol. 1, pp. 253–266
 Boyarsky, K. K. and Kanevsky, E. A. Vega — computer system of text classification and analysis, in Scientific-technical bulletin of St. Petersburg State university of information technology, mechanics and optics, 2009, vol. 5 (63), pp. 98–105
 Rihoux, B. Qualitative Comparative Analysis (QCA) and related techniques: Recent advances and challenges, Methoden der vergleichenden Politik-und Sozialwissenschaft, VS Verlag für Sozialwissenschaften, 2009, pp. 365–385
 Mikheenkova, M. A. On the principles of a formalized qualitative analysis of sociological data, in Information Technologies and Computing Systems, 2009, no. 4, pp. 40–56
 Klimova, S. G., et al. DSM-method in qualitative sociological study: basic principles and use experience, in Sociological journal, 2016, vol. 22, 2. pp. 8–30
 Chuvgunova, O. A. Planning process characteristics: data quality analysis, in Azimuth of scientific research: pedagogy and psychology, 2017, vol. 6, no. 2 (19), pp. 299–302
 Oleinik, A. N. Collecting, aggregating and processing quality data, in Sociological research, 2014, vol. 5 (361), pp. 121–130
 Babich, N. S. Classification of inequality situations: data analysis algorithms, in Theory and practice of social development, 2013, no. 3, pp. 49–53
 Kotelnikov, E. V. Choosing the data structure to provide hypotheses in DSM-method to analyze texts, in Fundamental research, 2016, no. 10–2, pp. 301–305
 Zavyalova, N. B. and Motorygin, S. V. Principles and technologies for analyzing quality data in economics and management, in Modern tendencies in economics and management: a new perspective, 2012, no. 17, pp. 289–294
 Oleinik, A. N. Content-analysis of big quality data, in International Journal of Open Information Technologies, 2019, vol. 7, no. 10, pp. 36–49.
 Pokrovskaya, I. V. et al. Quality data intellectual processing methods, in Machine learning ad data analysis, 2014, vol. 1, no. 10, pp. 1396–1406.
 Flegontov, A. V. and Fomin, V. V. Intellectual data processing systems, in News of the Russian State Teachers’ University of A. I. Herzen, 2013, vol. 154, p. 41–48.
 Lomazov, A. V. Designing an intellectual system to analyze data of sociological research, in Science news, 2020, vol. 1, no. 1 (22), pp. 160–165.
 Klimova, S. G., et al. Opportunities and conditions of using the quality data formalized analysis in sociological researches, in Bulletin of the Russian Foundation for Fundamental Research, 2016, vol. 3 (91), pp. 100–107
 Martyshenko, N. S. Forming a professional image for the future employment of university graduates, Azimuth of Scientific Research: Economics and Management, 2017, vol. 6, no. 4 (21), pp. 179–183
 Batura, T. V., et al. Methods to analyze and process data from social networks, in Informatics issues, 2014, vol. 2(23), pp. 39–53
 Televnoy, A. D. and Ivanov, S. E. Hybrid method to cluster data for analyzing social networks, in Eurasian science bulletin, 2019, vol. 11, no. 2, pp. 79
 Babich, N. S. and Khomenko, V. I. The typology of measurement levels in the sociology: traditional and alternative approaches, in RGSU Bulletin. Series: Philosophy. Sociology. Art history, 2012, vol. 2 (82), pp. 86–97
 Kuchenkova, A. V. and Tatarova, G. G. The strategy of applying logic-combinatory methods in procedures of typological analysis, in Sociology: methodology, methods, mathematical modeling, 2013, vol. 36, pp. 007–035
 Burlov, A. V. and Tatarova, G. G. The method of non-finished sentences in studying of the “cultural person” image, in Sociology: methodology, methods, mathematical modeling, 1997, vol. 9, pp. 005–031
 Lysenko, N. N. The system of education and “cultural person model”, Position, in Philosophical problems of science and technology, 2012, vol. 5 (5), pp. 148–154
 Martyshenko, S. N. and Martyshenko, N. S. Cognitive model information technology, in Software systems and computing methods, 2016, vol. 4, pp. 362–374