The Technology of Intelligent Analytical Processing of Digital Network Objects for Detection and Counteraction of Inappropriate Information

Lidia Vitkova, Laboratory of computer security problems St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia, vitkova@comsec.spb.ru

Igor Saenko, Laboratory of computer security problems St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia, ibsaen@comsec.spb.ru

Igor Parashchuk, Laboratory of computer security problems St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia, parashchuk@comsec.spb.ru

Andrey Chechulin, Laboratory of computer security problems St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia, chechulin@comsec.spb.ru

Abstract  Nowadays, the Internet and social networks are becoming one of the most important threats to personal, public and state information security. This determines the need to protect the individual, society and the state from information that is distributed through computer networks and is capable of harming the health of citizens or motivating them to illegal behavior. The paper offers a novel technology of intelligent analytical data processing, which allows one to detect inappropriate information in digital information objects and develop measures to counter this information. The technology is based on machine learning and big data processing methods. Definitions for key concepts of technology are given. The principles of technology implementation are formulated that ensure the satisfaction of the requirements for adaptability, scalability and real-time data processing. The main stages of the technology that carry out distributed intelligent scanning, multidimensional assessment and categorization, elimination of incompleteness and inconsistency, and visualization of inappropriate information, as well as the development of countermeasures, are identified. An experimental evaluation of the proposed technology in the parental control system and in the fake news detection system has demonstrated its high efficiency.

Keywords: information security, protection against information, data mining

© The Authors, published by CULTURAL-EDUCATIONAL CENTER, LLC, 2020

This work is licensed under Attribution-NonCommercial 4.0 International

I. Introduction

Nowadays, the Internet and social networks, which can be represented in the form of large sets of interconnected digital network information objects, are becoming one of the most important threats to personal, public and state information security. This determines the need to protect the individual, society and the state from information that is distributed through computer networks and is capable of harming the health of citizens or motivating them to illegal behavior. The problem of protection against inappropriate, doubtful and harmful information has an extremely small number of scientific and technical solutions. Despite the fact that in recent years, techniques and implementations of the individual components of such protection systems have appeared [1–3], they do not implement the full range of proposed capabilities.

The paper discusses the basics of the developed technology of intelligent analytical processing of digital network information objects to detect and counteract inappropriate, dubious and harmful information. This technology is required to provide adaptive and highly scalable analytical processing of information objects, which provides for the detection of inappropriate digital content in real or near real time, as well as for effective counteraction to this content. In order to meet these requirements, the proposed technology implements a number of specific data processing processes, according to which this technology differs from many other data processing technologies.

The theoretical contribution of the work is as follows: (1) the conceptual apparatus for the implementation of the developed technology is substantiated; (2) the main stages of this technology are proposed; (3) the results of an experimental evaluation of the technology are presented and the conditions and features of its implementation for various subject areas are formulated.

The paper has the following further structure. Section 2 presents the results of an analysis of related work. Section 3 contains the theoretical foundations regarding the conceptual apparatus and stages of technology. Implementation and experimental evaluation of the technology are discussed in section 4. Section 5 concludes the work and shows further research directions.

II. Related Work

Related work can be divided into two groups. The first one includes works related to the general issues of detecting and counteracting inappropriate information in the Big Data area. The second group includes works devoted to solving the problem of classifying digital information content.

One of the first references to Big Data is given in [4], where Big Data was understood as an explosion of the number of sources and messages. In later works, the term Big Data came to be understood as data sets, including heterogeneous formats, structured, unstructured, and semi-structured data. For example, in [5] Big Data is considered as a technology that allows one to analyze a huge amount of data.

In [6], it is stated that we are immersed in a social network and in information technology. This also determines the need for analysis of large, heterogeneous data.

In [7], a technology for creating a visualization system for large volumes of synoptic data was proposed. The technology uses distributed and heterogeneous (parallel and hybrid) calculations.

In [8], solutions for cluster analysis and visualization of large volumes of data are considered. In this work, a system is proposed for loading, normalizing and analyzing input data, conducting cluster analysis, visualizing and saving the results.

In [9], the approach is considered for categorizing content with automatic generation of classification rules. The format of the rules is the disjunctive normal form (DNF). The rule generation algorithm is based on the sequential replacement of one of the conjuncts.

In [10], it is proposed to treat the analyzed document as an array of real-valued coefficients, which are the relative and absolute frequencies of the occurrence of certain words in the classified text.

In [11], the applicability of two types of convolutional neural networks, namely, a feed forward neural network and a convolutional neural network with bag-words transformation on the convolutionary layer, is considered. As a result of the experiments, it was revealed that the first type of a neural network demonstrates greater performance in terms of classification indicators as compared with the second type of a neural network.

In [12], a method for extracting the features is considered as a part of a text categorization task. The proposed modification of the genetic algorithm, as shown by experiments, makes it possible to achieve a more compact representation of the training vectors in terms of their dimension and to improve the classification quality of the analyzed text.

In [13], several main successive stages of solving the problem of text classification are distinguished. These stages are: pre-processing or normalization of the text; feature extraction; reducing the dimension of the feature space; classifier training using machine learning methods; classifier rating.

It should be noted that the attributes that remained in the system after the stage of reduction of attributes can be used to train the classifier. Currently, the following main methods are highlighted, which are most often used for teaching text classifiers: probabilistic methods, for example, a Naïve Bayes classifier [14, 15]; linear methods, for example, Logistic Regression and the Support Vector Classification method [14–16]; methods using Artificial Neural Networks [17–19].

Thus, the analysis of related work shows that the known works on detecting inappropriate information in digital content do not fully meet the requirements. At the same time, an increase in the efficiency of functioning of such systems is achieved by endowing them with intellectual functions for processing Big Data. We have already obtained initial results in this area [20]. In the paper, we generalize them and develop in the direction of practical application in various case studies.

III. Theoretical Foundations

A. Key Concepts

Inappropriate information is often perceived in the modern scientific community as an element of informational impact. Information Effect is interpreted as the main striking factor of information warfare. It represents the impact of the information flow on the information system as an object of attack. The purpose of such an impact is to achieve structural and/or functional changes to the system by receiving and processing this information.

Formally, the information effect is defined as follows:

IE (IO), (1)

where IE () — the function determining some information impact, IO — the informational object, R — the result of the information impact.

The Information Object is a logically integral block of information, presented in a certain fixed form, which is created and used in human information activity. Formally, the connection of this concept with other concepts is as , i. e. information object is an element of the set of all analyzed information I.

Using the concept of an information object, we give the following definition of inappropriate information. In this case, we will rely on the classification of types of information on the Internet. We denote all information on the Internet as Int. Assume that the set Int contains Risky Information (RI) and Safe Information (SI). The following equality holds between these concepts:

Int = RI + SI. (2)

Inappropriate information (II) is a separate information object and/or a collection of objects on the Internet containing signs that fall under the categories of uselessness, unnecessary. Information objects filtered by the parental control system are the most striking example here. Inappropriate information is a risky information class. In addition to it, risky information also includes dubious and harmful information. We give their definition.

Dubious information (DI) is a separate information object and/or a collection of objects on the Internet containing signs that fall under risky categories. For example, a fishing site, an untrusted resource, a resource with a low reputation. An object containing false information or misinformation is also of this type.

Harmful information (HI) is a separate information object and/or a collection of objects on the Internet containing information prohibited for distribution. On the example of the state, the following types of information fall under this category: (1) information that falls under the criteria for evaluating materials and (or) information to identify it as prohibited for distribution; (2) information included in the federal list of extremist materials; (3) an information object included in the lock register.

Thus, the proposed classification of types of hazardous information on the Internet allows you to configure a monitoring system, split and analyze the data. Obviously, these types of information are united by the following equality:

RI = II + DI + HI. (3)

B. Implementation Principles and Objectives of the Technology

To satisfy the above requirements for the developed technology processing digital content, it is necessary to follow a number of principles on which this technology should be built. The following principles should be highlighted:

Preliminary assessment (categorization) of the information object. This principle provides the possibility of forming an initial judgment on the belonging of an object to a particular category and constructing a final judgment on an information object in case other information about it is not available;

A parallel and hierarchical analysis of various aspects of data about an object in the borders of three main areas: a) the contents of the object, b) its structure; c) his story. In the general case, the use of a group of simple classifiers gives a greater gain in the speed, accuracy and maintainability of the categorization system compared to the use of one complex one;

Application of methods for combining the results (predictions) of individual classifiers to obtain a final judgment on the object. As a result, each of the main stages of the decision on the category of an information object can be related to a separate aspect of its state;

The use of various methods to counter inappropriate information. So, for example, to protect against advertising mail messages, the best way is automatic deletion, to protect children from inappropriate information — automatic forwarding, for information prohibited by the legislation of the state — blocking, etc.

The main objectives of data processing in the technology are to identify in the information objects of the Internet and in social networks inappropriate, dubious and harmful information and to develop measures to counter these types of information.

The main tasks solved by the developed technology are:

• obtaining information from a variety of heterogeneous sources about possible sources of storage and distribution of inappropriate, dubious and harmful information;

• determination of the thematic focus of the analyzed information;

• counteraction to the loading, display and dissemination of inappropriate, dubious and harmful information;

• displaying the received data and the results of their processing in a convenient graphical interface;

• identification of ways to disseminate inappropriate, dubious and harmful information on social networks.

C. Data Processing Stages in the Technology

The developed technology contains several interrelated processing stages that are absent in known technologies and systems of this type. The relationship of the steps is shown in Fig. 1. These stages include:

1) the stage of distributed intelligent scanning of network digital content; it provides prompt, reliable and resistant to external influences collection and preliminary assessment of information resources of the Internet and social networks;

2) the stage of multi-­aspect assessment and categorization of the semantic content of information objects, containing means for analyzing textual, structural, addressable, connected, multimedia and other signs;

3) the stage of ensuring the timeliness of multi-­level and multi-­module analysis of information objects based on the use of parallel computing;

4) the stage of eliminating the incompleteness and inconsistency of the assessment and categorization of the semantic content of information objects based on the use of processing methods for incomplete, contradictory and fuzzy knowledge;

5) the stage of adaptation and retraining of the analysis system of information objects, including in the operation mode;

6) the stage of development and selection of measures to counteract inappropriate, dubious and harmful information;

7) the stage of implementation of visual interfaces to identify and counteract inappropriate, dubious and malicious information.

Let us consider the content of these stages in more detail.

1) Distributed intelligent scanning of network digital content: it provides decentralized modes of collecting and preprocessing data contained in information objects, by endowing scanners with intellectual properties. The main functions of data preprocessing are filtering, normalization (reduction to a common format), generalization and prioritization of data [21].

A feature of the functioning of intelligent scanners is to increase the efficiency of data collection through the use of intelligent methods of distributed information stream processing.

2) Multidimensional assessment and categorization of the semantic content of information objects: data mining methods and classifiers based on them that take into account a sufficiently large number of features are used. These ones include textual, structural, addressable, connected, multimedia and other signs, as well as signs extracted from external data sources about an information object. For the analysis of information objects according to the extracted and formed features, a multilevel classification is used [22]. Elements of the first level are functional blocks oriented to separate categories. They use pre-trained classifiers to determine the fact that a particular vector characterizing an information object belongs to a certain category. Elements of the second level decide whether the vector of descriptions of the information object belongs to one of the specified categories based on the analysis of individual aspects of the object (text, structural features, source, etc.). The final decision is formed at the third level when generating descriptions of the analyzed information objects.

3) Ensuring the timeliness of multi-­level and multi-­module analysis of information objects: it is implemented on parallel data processing technology. As a prototype for the development of such a platform, Hadoop, Spark, Flink and others are considered. As the main indicators used to develop requirements and assess the functioning of these components, indicators of scalability and reliability of information storage are taken into account. Scalability requirements are met by changing the number of executive computing nodes and processes in the platform. Reliability requirements are implemented by organizing and using the Hadoop Distributed File System (HDFS) for storing information.

Figure 1.The relationship of the stages of the technology.

4) Elimination of the incompleteness and inconsistency of the assessment and categorization of semantic content: this gives classifiers the ability to process incomplete, contradictory and fuzzy knowledge. The incompleteness and inconsistency of the assessment and categorization of the semantic content of information objects is characteristic of this process due to the very nature of the distribution of inappropriate, dubious and harmful information on the Internet and social networks. The methods of fuzzy assessment of situations, fuzzy inference and fuzzy optimization are used. These methods are among the first in the field of analytical processing of digital content.

5) Adaptation and retraining of the analysis system of information objects, including in operation mode: it uses expert knowledge and existing lists of classified information objects. During retraining, various aspects of information objects are taken into account as much as possible, including the history of changes and data from third-­party sources.

6) Development and selection of measures to counteract inappropriate, dubious and harmful information: the stage is characterized by the use of information storage built on the basis of the HDFS file system. In addition, an inference mechanism based on expert knowledge is applied. Various techniques are used to calculate the harmfulness and danger of digital content and risk analysis. Methods of automatic inference and visual analysis are combined.

7) Implementation of visual interfaces for identifying and countering inappropriate, dubious and malicious information: interfaces are distinguished by their orientation toward the use of non-standard visualization models, which include such models as graphs with glyphs, tree maps, Voronoi diagrams and others, as well as virtual and augmented reality models.

IV. Implementation and Experimental Assessment of the Technology

The implementation and experimental evaluation of the developed technology were carried out in various case studies. The first case study is a parental control system, or a system for detecting inappropriate information. In this system, information objects collected from the Internet are classified according to several predefined categories of inappropriate information. The second case study is a fake news detection system.

A. Inappropriate Information Detection System

The inappropriate information detection system was implemented in two versions. In the first (single) version, one machine was used with the following characteristics: macOS High Sierra operating system; 2.3 GHz Intel Core i5 processor; 8 GB2133 MHz LPDDR3 memory; Intel Iris Plus Graphics 640 1536 MB graphics card. In this version, the following library classifiers were investigated: a Decision Tree, a Support Vector method, a multinomial Naïve Bayes classifier, and a Random Forest.

In the second (parallel) version, a Hadoop cluster was used under the control of the VMware ESXi 6.0 hypervisor. The hardware had the following characteristics: Supermicro X9DRL-3F / iF motherboard; two Intel Xeon E5–2620v2 @ 2.1 GHz processors (6 cores and 12 threads each); 131 GB RAM memory; two hard drives with a total capacity of 8 TB. Five virtual machines were created, of which four were running Ubuntu Server 18.04 LT, and one was running Ubuntu Desktop 18.04 LTS. To increase computing performance, Spark distributed computing was installed on a top of Hadoop. In the second variant, classifiers implemented in parallel mode and present in the Spark distributed computing system in the MLlib library were studied. Such classifiers include: logistic regression, the Support Vector method, Naïve Bayes classifier, and Random Forest.

The data set consisted of 78,663 web pages collected from the Internet. All web pages belonged to one of 19 categories: Adult English, Beer, Casino, Cigarette, Cigars, Cults, Dating, Jew Related, Marijuana, Occults, Prescription drugs, Racist groups, Religion, Spirits, Sport betting, Violence, Wine, Weapon, and Other. The belonging of a web page to a category was determined on the basis of expert evaluation. The results of an experimental evaluation of the technology implemented in the system for detecting inappropriate information in single and parallel versions according to PrecisionRecall and F-measure indicators are shown in Table 1 and Table 2, respectively. Table 3 shows the time spent on training and testing a dataset in various versions.

Table 1. Estimation Indicators for a Single Version

Classifiers

Estimation indicators

Precision

Recall

F-measure

Support Vector Classifier

0.92

0.92

0.92

Decision Tree

0.84

0.84

0.84

Multinomial Naïve Bayes

0.81

0.74

0.73

Random Forest

0.84

0.84

0.84

Table 2. Estimation Indicators for a Parallel Version

Classifiers

Estimation indicators

Precision

Recall

F-measure

Support Vector Classifier

0.89

0.91

0.90

Logistic Regression

0.79

0.81

0.80

Naïve Bayes

0.84

0.79

0.81

Random Forest

0.81

0.81

0.81

Table 3. Time Indicators for Various Classifiers

Classifiers

Variant

Time, sec

Training

Testing

Support Vector Classifier

single

41047

950

parallel

27123

223

Naïve Bayes

single

1340

27

parallel

1256

4

Random Forest

single

3715

29

parallel

2934

4

Decision Tree

single

4160

35

Logistic Regression

parallel

288

1

The obtained experimental results allow us to make the following conclusions. First of all, it should be noted the high efficiency of the developed technology in this case study. This conclusion is confirmed by the rather high values of the estimation indicators.

Further, in both versions, Support Vector Classifier turned out to be the most effective. However, the selection of the optimal parameters for this classifier took a much longer time compared to other classifiers.

Differences in the accuracy of similar classifiers implemented in single and parallel versions are associated with the features of their software implementation, as well as the inability to repeat the same experiments on different platforms. It should be noted that the Spark system did not give a significant gain in training time. The reasons for this are as follows. First, the Hadoop cluster was deployed on virtual machines. This entails increased memory and processor resources. Moreover all virtual machines use one physical hard drive and this is also affects the result. Secondly, the data set used is not a large data set in the classical sense.

B. Fake News Detection System

The fake news detection system was implemented on a single machine with the following characteristics: QuadCore Intel Core i7–8565U, 4500 MHz (45 × 100), 16 GB DDR4–2666 DDR4 SDRAM. To speed up the data processing, a GPU was used by the PyTorch python library. The pre-trained BERT neural network provided by Google was used as a classifier. According to the specification the BERT model does not require a large dataset for training.

As a data set for the experiment the text news set available at https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset was used. Each record in this data set has the following structure: news headline, main text of the news, area to which the news belongs, date of publication of the news. The total size of the data set is 34,267 records.

Initially, the data set consisted of two parts: real and fake news separately. For the experiments these parts were mixed and combined into joint structure with the equal number of records with true and fake news. After it, the dataset was divided into 2 parts: training and testing set with the ratio of 75% and 25%.

During this experiment, only the text of the news itself was used to classify the news. This field has a different length, so for convenience, before each tokenization, each such text was divided into several fragments. The size of each fragment was 300 tokens. Then, during the training period, each of these fragments was sequentially transmitted to the input of the neural network. The training was controlled by stochastic descent, and the standard error was used as a function of losses.

Based on the experimental results of technology evaluation on this system, the forecast accuracy was 95%. The precision and recall values were also 95%. Taking into account that the studied data set had a balanced distribution among classes, we can conclude that the fake news detection system that implements the proposed technology of intelligent analytical processing of digital information objects has a sufficiently high efficiency.

V. Conclusion

The paper offers a novel technology of intelligent analytical data processing, which allows one to detect inappropriate information in digital information objects and develop measures to counter this information. Definitions are given for the key concepts of technology, which include the information effect and the information object, as well as inappropriate, dubious and malicious information. The principles of technology implementation are formulated that ensure the satisfaction of the requirements for adaptability, scalability and real-time data processing that are given to it. The main stages of the technology are highlighted. They include: distributed intelligent scanning of network digital content; multidimensional assessment and categorization of the semantic content of information objects; ensuring the timeliness of multi-­level and multi-­module analysis of information objects; elimination of incompleteness and inconsistency of assessment and categorization of semantic content; adaptation and retraining of the analysis system of information objects, including in operation mode; development and selection of measures to counteract inappropriate, dubious and harmful information; implementation of visual interfaces for detecting and countering inappropriate, dubious and malicious information.

An experimental evaluation of the developed technology was carried out for various case studies. For a case study related to the detection of inappropriate information in the parental control system, the technology was investigated in single and parallel versions. In both versions, the technology showed a rather high detection efficiency of inappropriate information. The highest detection accuracy was obtained using the Support Vector Classifier. However, this classifier takes a long time to learn. For a case study related to the detection of fake news, very high values of performance indicators were also obtained. Here, a neural network was used as a classifier. The experimental results in this case study also indicate the high efficiency of the proposed technology of intellectual analytical processing of digital information objects.

Further research is associated with the implementation and evaluation of the proposed technology of intelligent analytical processing of network digital content on data sets of a significantly larger volume and in a more efficient computing environment.

ACKNOWLEDGMENT

This research is being supported by the grant of RSF #18–11–00302 in SPIIRAS.

REFERENCES

 [1]   Scott, J. Social network analysis: developments, advances, and prospects, in Social Network Analysis and Mining, January 2011, vol. 1, no. 1, pp. 21–26

 [2]   Baykan, E.Henzinger, M.Marian, L., and Веber, I. Purely URL-based topic classification, in Proceedings of the 18th international conference on World wide web (WWW ‘09), ACM, New York, NY, USA, 2009, pp. 1109–1110

 [3]   Kotenko, I.Chechulin, A.and Komashinsky, D. Categorisation of web pages for protection against inappropriate content in the Internet, in International Journal of Internet Protocol Technology (IJIPT), 2017, vol. 10, no. 1, pp. 61–71

 [4]   Gandomi, A. and Haider, M. Beyond the hype: big data concepts, methods, and analytics, in Int. J. Inf. Manag., 2015, vol. 35, no. 2, pp. 137–144

 [5]   Oussous, A., Benjelloun, F.-Z., Lahcen, A. A., and Belfkih, S. Big Data technologies: a survey, in Journal of King Saud University — Computer and Information Sciences, 2017, vol. 30, no. 4, pp. 431–448

 [6]   Lazer, D.Pentland, A. S.Adamic, L.Aral, S.and Barabasi, A. L. Life in the network: the coming age of computational social science, in Science, 2009, vol. 323, no. 5915, pp. 721–723

 [7]   Melman, S. V.Bobkov, V. A., and Cherkashin, A. S. Technology and system visualization of large amounts of synoptic data, in Journal Information Science and Control Systems, 2015, vol. 3, no. 45, pp. 63–71

 [8]   Das, T., Saha, R., and Saha, G. Cluster Analysis Using Big Data Visualization, in Proceedings of the 2nd International Conference on Information Systems & Management Science (ISMS), Tripura University, Agartala, Tripura, India, 2019, 5 p.

 [9]   Apté, C., Damerau, F., and Weiss, S. M. Automated learning of decision rules for text categorization, in ACM Transactions on Information Systems (TOIS), 1994, vol. 12, pp. 233–251

[10]   Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval, in Information processing & management, 1988, vol. 24, no. 5, pp. 513–523

[11]   Johnson, R. and Zhang, T. Effective use of word order for text categorization with convolutional neural networks, in Proceeding of the 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 103–112

[12]   Ghareb, A. S.Bakar, A. A., and Hamdan, A. R. Hybrid feature selection based on enhanced genetic algorithm for text categorization, in Expert Systems with Applications, 2016, vol. 49, pp. 31–47

[13]   Feldman, R. and Sanger, J. Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, New York, NY, USA: Cambridge University Press, 2006

[14]   Moraes, R.Valiati, J. F.and Gavião Neto, W. P. Document-­level sentiment classification: An empirical comparison between SVM and ANN, in Expert Systems with Applications, 2013, no. 40, pp. 621–633

[15]   Medhat, W.Hassan, A.and Korashy, H. Sentiment analysis algorithms and applications: a survey, in Ain Shams Eng. Jour., 2014, no. 5, pp. 1093–1113

[16]   Pontiki, M.Galanis, D.Pavlopoulos, J.Papageorgiou, H.Androutsopoulos, I., and Manandhar, S. SemEval-2014 Task 4: Aspect based sentiment analysis, in Proc. 8th Int. Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 2014, pp. 27–35

[17]   Zhang, X., Zhao, J.and LeCun, Y. Character-­level Convolutional Networks for Text Classification, in Proc. of the Neural Information Processing Systems Conf. (NIPS2015), Montreal, Canada, 2015

[18]   Ghiassi, M.Olschimke, M.Moon, B., and Arnaudo, P. Automated text classification using a dynamic artificial neural network model, in Expert Systems with Applications, 2012, no. 39, pp. 10967–10976

[19]   Fuller, C. M.Biros, D. P.and Delen, D., An investigation of data and text mining methods for real world deception detection, in Expert Systems with Applications, 2011, no. 38, pp. 8392–8398

[20]   Vitkova, L.Saenko, I., and Tushkanova, O. An approach to creating an intelligent system for detecting and countering inappropriate information on the Internet, in Kotenko, I., Badica, C., Desnitsky, V., Baz, D. El, and Ivanovich, M. Intelligent Distributed Computing XIII. IDC2019. Studies in Computational Intelligence, Eds. Springer, Cham, 2019, vol. 868, pp. 244–254

[21]   Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan, P. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies, in The VLDB Journal, August 1998, vol. 7, pp. 163–178

[22]   Dumais, S. and Chen, H. Hierarchical classification of web content, in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ‘00), ACM, New York, NY, USA, 2000, pp. 256–263