Towards a semi-automatic classifier of malware through tweets for early warning threat detection*

This paper presents a method for developing a malware ontology structure by detecting malware instances on Twitter. The ontology represents a semi-automatic classifier fed by the data extracted from tweets. In particular, the automatic part of the presented methodology relies on a pattern-based approach to detect trigger expressions leading to new information about malware


Introduction
This paper will describe a preliminary study on a method to detect new data about malware and structure them in an ontology model.The ontology represents a means through which is possible to build a classifier able to structurally organize the knowledge behind the malicious events within the cyber-sphere.The novelty brought from our approach can be envisaged in the source documentation taken into account to construct the malware classifier.More specifically, not only the set of documents considered, i.e., tweets, but also the techniques applied to retrieve the information about the malware can constitute the originality of this work.In detail, we propose an approach which, through the Natural Language Processing (NLP) tasks over the normalized group of tweets is meant to systematize the informative set of obtained data into an ontology framework.The ontology represents the classifier created in a semi-automatic way and can help cyber analysts in creating a conceptual structure to infer knowledge about malicious events as well as in supporting malware triaging operations from a semantic point of view.

Related Works
The detection of new events from Twitter represents a common research branch and usually is focused on the interpretation of tweets' content from a topic-based approach, as Twitter Stand platform created at Maryland University (Sankaranarayanan 2009) shows by capturing the late breaking news from tweets becoming popular topics per each country.Regarding the cyber threats detection from Twitter, Gaglio (2015) proposed an extension of Soft Frequent Pattern Mining (SFPM) through an improved topic detection algorithm with the presentation of Twitter Live Detection Framework (TLDF) able to face the new incoming data from a topic detection perspective.Cordeiro (2012) presented a work on topic inference events from the social platform by using the Latent Dirichlet Allocation topic inference model based on Gibbs Sampling.Concone et al. (2017) also proposed a methodology to detect, and give an alert on, new malware using the data coming from reliable Twitter's subscribers by means of a Bayes naıve classifier.Specifically, they worked with the "Bayes classifier trained on a set of tweets containing an equal number of i) events related to security attacks, viruses, malware, and ii) generic messages", and realized "groups of tweets discussing the same topic, e.g, a new malware infection, are summarized in order to produce an alert".The authoritativeness of users selected by the authors has been based on an "influence metric" which links the users' interaction with the community in terms of retweets, feelings, answers and number of likes.Another study covering cyber threat detection from Twitter is that of Sabottke (2015), where the authors specifically refer to the exploit detection by creating a Twitter-based exploit detector.This system detects on Twitter the use of exploits against known vulnerabilities by looking within the tweets containing texts mentioning vulnerabilities and comparing, as ground truth, to CVE IDs as well as ExploitDB and classifying them using the SVM classifier.Given the increase in the variants of malware, a resource able to analyze similarities and gather these features as informative elements in a classification structure becomes a valid means for the enhancement of cybersecurity predictive actions.In the literature review, malware classification has been considered as an urgent and evolving study to foster and a wide range of techniques has been proposed within the scientific community.The most common way to identify increasingly complicated malware is signature-based, (Akhtar and Feng 2022) offer a literary review of new machine learning based techniques which aim at analyzing the efficacy of those approaches in the identification of Polymorphic malwares.The method we hereby propose may show potential benefits for the malware identification independently from their code, but, conversely, by analyzing the contents of tweets dealing with new malware outbreaks.A comprehensive overview of Deep Learning (DL) tasks used to classify malware events is offered by Mathews (2019).Still referring to DL approaches, Kalash (2018) proves the better performance of Convolutional Neural Networks (CNN) in identifying and properly classifying malware, as well as Tekerek and Yapici (2022), Adem andYapici (2022), andHabibi (2023).Amongst the main tested malware classification procedures, the call graph clustering approach by Kinable and Kostakis (2011) has been executed for the detection of structural similarities between malicious behaviour samples.Echo state networks (ESNs) and recurrent neural networks (RNNs) have been exploited, among a vast number of experts, by Pascanu (2015) to detect malicious files.Annachhatre (2015) applies hidden Markov models and cluster analysis to discover malware classes, whereas Tang (2023) uses LightGBM to identify malware families.Mirza (2018) proposes a combination of machine learning approaches employed over a group of features extracted from a wide corpus made of benign and malicious files through a bespoke feature extraction tool.A set of studies focus on semantics in malware code detection under the lens of obfuscation used by attackers to hide the actual code and the behaviour of malware (Singh 2018;Sahu 2014;Christodorescu 2005).Against this background one cannot fail to mention the main classification systems which guide the comprehension and representation of the malware families, their features and targets in their attacking processes, both released by the Mitre Corporation, an American association that supports government structures specifically with respect to the cybersecurity area.The first one is the MITRE ATT&CK platform, a webbased tool that helps in enhancing knowledge on threat tactics and techniques applicable to several operative systems: it is subdivided into 14 tactics and 188 techniques representing the ways by which attackers can perform a cyber attack against the infrastructures.This tactics' representation supports either the acquisition of a knowledge base in the cyber adversaries' techniques and a dictionary modeling of this information under a classification perspective to overcome cyber threats (Xiong et al. 2022;Georgiadou 2021; Kwon 2020).The second one is the Common Attack Pattern Enumeration (CAPEC) catalogue that gathers under a tree-like configuration a range of attacks' mechanisms and attacks' domain by merging the patterns according to the common features they share (Kotenko and Doynikova 2015).This structure becomes essential to understand the adversaries' behaviours and to create a common dictionary and a classification taxonomy of the attacks' patterns to be used by analysts or developers working within the cyber defense field (Andrei 2019).
In this paper we propose a new method to detect in a predictive way the new malware denomination and classify them by a hierarchical structure through an ontological configuration .The proposed method will provide institutions working within the cybersecurity strategy plans with a classification tool to manage their knowledge base when exposed to cyber threats.This will be empowered by an updated classification system, i.e., an ontology, covering the connections with the malware families' attributes starting from a tweets analysis.With respect to other malware ontologies existing and largely used in the scientific community (see for instance those of Rastogi 2020;Zareen 2016;Huang 2010), the one proposed in our work will aim at potentially including the zero-day cyber-attack events through the detection of the semantic information within tweets in a father-son classes' relationship and object properties, by exploiting the advantages of OWL language (McGuinness and Van Harmelen 2004;Antoniou and Van Harmelen 2004;Wang 2004).The main contributions of this paper can be summarized as follows: (i) a tweets dataset taken into consideration to realize the classification tool, (ii) the terminological analysis of the texts extracted from Twitter (iii) that leads to (iv) the employment of NLP methods, particularly based on a patterns-based approach (León-Araúz 2013; Auger and Barrière 2008) to find trigger expressions in tweets, used to retrieve in a predictive way the upcoming malware classes within the cybersecurity spectrum.

Methodology
Our methodology relies on a source corpus containing a set of documents constituted by tweets, which, for their intrinsic nature, are marked by a regular configuration as well as by an unstructured way to formulate texts from a linguistic perspective (Arora and Kansal 2019;Kumar and Das 2013).The reason why tweets have been chosen as source corpus is related to the predictive goal of this work towards early malware detection and their subsequent classification: tweets are a real-time information that can allow a punctual acquisition of knowledge to be studied.Indeed, "Tweets are free text micro blogging posts of no more than 140 characters, used by millions of people around the world; with one important characteristic, its real-time nature.Although their length per post is limited, the variety of words that can be used is high.If we take in account that each single word represents a different variable, a tweet is considered a high dimensional data."(-Gutierrez 2014, 168) Moreover, as observed by Barnard (2016), each tweet owns an inner narrative form of communication that synthetically highlights salient information to be retrieved.Thanks to the character limit behind the posts' publication the information within tweets can be spread in an immediate way without facing the semantic noise that generally implies a massive removal of unnecessary text (Gupta and Rao 2020).On the basis of using the Twitter microblogging as a major source of information (Bakliwal 2012) our approach that leads to a malware predictive detection and their classified configuration can be described by the steps depicted in Figure 1.Montangero and Furini (2015) proposed a method based on the algorithm TRank that connects the user's activity, i.e., tweets, and the profile itself in order to reveal the user's level of influence; Cappelletti and Sastry (2012) set out a technique based on the IARank ranking model that orders the information about Twitter users in a real-time span, the logic behind it is to compute the average of users' influence by taking into account the retweet and mentions as information amplification sources being both the features proving how a user is likely to be retweeted and mentioned; Yamaguchi (2010) published a work presenting an algorithm called TURank (Twitter User Rank) based on the connection existing between users and tweets, both of them represented by a user-tweet graph.For the purpose of our initial study on how the new malware occurrences are publicly shared on Twitter, alongside the support of experts in the cybersecurity field of knowledge, the platform MalwareBazaar has been chosen as a first resource from which to begin to test the extraction of Twitter users' profiles.These latter usually share information about the new malware generation through their posts on social media.This platform offers to cyber analysts statistical means through which it is possible to be informed about the latest cyber threat reported by determined users identified as Top Reporters.Therefore, the first task addressed the crawling of tweets published by the main profiles indicated as 'reporters' on the MalwareBazaar portal, which are in total 10597.The crawling executed through a custom Twitter API client gave in output a list of files, each one of them including a list of metadata about users' activity in a tabular format.The columns within each file contain: tweet Id, Text, Name, Screen Name, Date, Favorites, Retweets, Language, Client, Tweet type (e.g., retweet, reply, tweet), URL, number of Hashtags, number of Mentions, Media type (e.g., photo) and Media URLs.The generated files are then parsed in a next step.In the extraction phase, just the column referred to the tweets' content (Text) from the crawling output has been performed.Each column extracted ISSN: 2038-1026 online Open access article licensed under CC-BY DOI: 10.36253/jlis.it-591has represented a separate file considered as a single document to put into the source corpus to be semantically analyzed.For instance, the tweet text column of a selected user, e.g., tolisec user, has represented a single document containing the 121 tweets published by this user.Successively, through Python, specifically with NLKT and SPACY packages, the texts have been cleaned, this step specifically addressed the removal of stopwords as well as of symbols and emoticons in order to make the documents processable for the term extraction tool, as shown in Table 1.

Terminological extraction
The term extraction has been realized through the software SketchEngine (Kilgarriff 2008;Jakubíček et al. 2014), a corpus analytic tool which gives in output a ranked list of the most representative terms included in the source corpus.Indeed, by using a semantic extractor it is possible to analyze the knowledge domain information under a terminological perspective and see which are the most frequent terms in the documents selected to reflect the information under study and apply reasoning techniques.For what concerns the frequency and the relevance of terms with respect to a specialized corpus, we used the Term Frequency Inverse Document Frequency (TF/ IDF) (Qaiser 2018) measure.This formula allows to have in the first position terms that are very specific to the domain under study, in this case the cybersecurity one, and in the last one those most commonly used in the general language.This measure supported the identification of the most representative lexical units and the building of a network of co-occurrences that have guided the systematization of the trigger expressions.Indeed, these latter have constituted the semantic means to discover new information about malware denominations in tweets meant to represent the new entities within the classification system.Table 2 shows an extract of the lemmas retrieved in the source documentation through the integration of a stopword list in English and the definition of the minimum frequency threshold at 1 in order to have as much terms as possible to detect.

Rule-based pattern recognition
The next step addresses the definition of the trigger expressions by relying on the terminological analysis, followed by their normalization through a rule-based pattern recognition (Babicm 2008; Anicic 2010).In detail, the objective of this phase is strictly connected to the one of realizing a classification tool on malware because it is aimed at establishing a set of rules to be applied over the corpus made up of tweets to retrieve new information about malware to be included into the classification tool, i.e., ontology, through the terms in the list obtained by the extraction (relying on the TF/IDF scores).The process continued by checking the co-occurrences identified within the terms in the source corpus.In detail, each term in SketchEngine can be analyzed according to the concordance terms show in the semantic distribution thanks to the syntactic connective structures within the source corpus through the Word Sketch function.
For instance, according to the output given by the semantic tool, the first two terms identified to generate a first list of regular expressions, alongside the support of domain experts, have been malware and ransomware.Therefore, the collocations represented the key combinations that could give back a networked kind of knowledge information through which to detect new names or to empower the cyber attack classification.A list of few collocations used in this preliminary stage is the following: -Modifiers of malware: infect victims with -using a, the most resilient, the most dangerous/ interesting malware -Malware + verbs: attack (malware attacking), call (malware (also) called), identify (malware identified as) These collocations have been used as trigger expressions to run the automatic identification of unknown malware and information about them meant to be included in the classification tool.Each of these expressions has been transformed into a syntactic structure following the rule-based pattern recognition (Xu and Cai 2021).In this way the automatic extractor tool is able to detect the morpho-syntactic structures to be mapped with the new information to be retrieved and populate the malware classifier in the form of ontology.Once understood that terms are accompanied by expressions that can lead to knowledge-domain discovery (Ervert 2008), the following phase of this research activity covered the creation of the trigger expressions to be implemented in the tool in order to represent the training set for the alert on new malware denominations sorted by time.The definition of the rule-based pattern recognition system is based on the exploitation of the regular expressions, created with the use of SpaCy library (Vasiliev 2020) in Python.This configuration has been based on the discovery of new denominations of malware starting from the information contained in the documents (the tweets of the users selected through the Malware Bazaar platform) interpreted under the lens of clues to be considered as alerts of new data to import.In contrast to the fixed pattern matching of regular expressions, this method allows us to match tokens according to some pre-set patterns.Additionally, it includes features such as parts-of-speech analysis, entity types, dependency parsing, lemmatization, and a great deal more.In addition to this, this further bolsters regular expression patterns.The token pattern matcher provided by the SpaCy library takes advantage of the word level features proper to this linguistic toolkit such as LOWER, LENGTH, LEMMA, and SHAPE as well as flags, such as IS PUNCT, IS DIGIT, LIKE URL.An input text may be given and rules can be defined in order to parse the text and determine whether or not it includes the appropriate morpho-syntactic objects in the appropriate sequence.In order to provide some practical examples of results obtained by our method, we present the outputs from the execution of the first group included in the list of regular expressions presented in Section 2.4.The sentence "infect victims with... malware" represents the starting point from which to discover new information about the malware being discussed.The resulting information retrieved by executing the tool over the corpus will be the key entity to be integrated in the classification system.In detail, regarding the case of the aforementioned sentence, the pattern-base code instructs the SpaCy library to recognize sentences that begin with one or more verbs, followed by one or more nouns, and by one or more prepositions or postpositions, possibly a determiner, the "malware" lemma, and then end with a noun that will be identified as the new potential malware name in this case.This enables us to identify sentences that contain different words but are constructed in the same syntactic manner.

Results
As a consequence of our work, various instances of patterns as well as their representation in the form of spacey regexes are hereby presented.These regexes have the potential to be utilized in the filtering of pre-processed and normalized tweets.The following short list represents an overview of the source expressions, selected under the basis of the terminological analysis outcomes, used to construct the patterns to be employed over the crawled tweets for the detection of new malware-related data.

Classification tool
In the literature many malware classification schemes have been configured, such as the Common Taxonomy for law Enforcement and the National Network of CSIRTs published by the Europol Public Information, which describes a range of incidents according to their class and type and then support the "mapping each type of incident with the pertinent article of the international legislative framework" (Euripol Public Information 2017, 5); MISP Taxonomies ; CIRCL taxonomy schemes of classification in Incident response and detection ; OSINT Open Source Intelligence; The VERIS Framework, Vocabulary for event recording and Incident sharing; Kaspersky which reports the types of malware by behaviors.
Our proposal relies on the construction of an ontology structure as a classification tool starting from the entities discovered by executing the pattern-based approach to the tweets' contents.
Ontology is a: "[...] hierarchically structured set of concepts describing a specific domain of knowledge that can be used to create a knowledge base."(Blomqvist and Sandkuhl 2005, 1) With reference to the meaning of ontology in the informatics area, Gaurino (2009, 2) gives a clear definition by stating that "Computational ontologies are a means to formally model the structure of a system, i.e., the relevant entities and relations that emerge from its observation, and which are useful to our purposes.An example of such a system can be a company with all its employees and their interrelationships.The ontology engineer analyzes relevant entities and organizes them into concepts and relations, being represented, respectively, by unary and binary predicates.The backbone of an ontology consists of a generalization/specialization hierarchy of concepts, i.e., a taxonomy."An interesting study, specifically oriented to the cybercrime field using ontologies, has been reported by Donalds and Osei-Bryson (2019) who, besides offering an extensive overview of the cybercrime classification schemes existing in the literature, describe in a practical way the realization of a high-level ontology for the cybercrime events, specifically called cybercrime classification ontology (CCO) through the use of Protégé platform (Sivakumar and Arivoli 2011) as our study will do.The author starts by isolating the main cybercrime-related concepts (i.e., attack event, vulnerability, tool and technique, objective, offence, location, complainant, victim, target, impact, attacker) and continues by enhancing the parent-child relationships through the use of the object properties which help in improving the attack events classification.
Our work regarded the construction of the classification scheme on the basis of tweets' analysis where the structure of the ontology follows the hierarchical configuration of classes and sub-classes and the association of each new malware discovered as an individual, as the following example demonstrates: Cyber_attacks hasSubclass Malware; Malware hasSubclasses Zombie, Crypro_min-er_malware, Trapdoor, Trojan_horse, Banking_malware, Virus, Logic Bomb, Worm, Ransomware.
In this regard, we targeted the inclusion of new denominations of malware in the classification scheme as the extreme leaf of this tree-like configuration, hence the matching will be between, for instance, Banking_malware hasIndividual: new name of worm.This process has been executed by running the semantic analysis over the tweets and confirmed by cyber defense experts as well as by relying on the information included in the aforementioned malware classification tools.Indeed, through the implementation of the regular expressions we have obtained encouraging results, some of which are the following: s demonstrated in these examples another element that can contribute to the retrieval of novelty trait can be the attention to some verbs, e.g., discover or hit, adjectives, e.g., new, or adverbs such as 'now' held up by gerundive constructions ('are now blocking'), which can lead to the inference of current data meant to be included in the ontology.The ontology offers a comprehensive way to represent the classified information by structuring it according to the generic-specific principle and associating to each class a set of instances, in our case study the new malware denominations, as Figure 3 depicts.The object properties expressed through OWL language relate pairs of entities (Glim 2014) and are the means of organizing the informative data by specific connections that explicit the conceptualization of the knowledge base.This will be the focus of our next research activity which will be performed using the additional data we crawled from Twitter.

Conclusion
This paper develops a new method to predict malware appearance through the analysis of tweets shared by active Twitter users identified in the MalwareBazaar database and spreading information on new malware instances.In this work we conducted a semantic analysis of the isolated users' tweets after a crawling operation and the configuration of a set of identified trigger expressions, normalized in the form of regular expressions, that are used as a knowledge inference task to tune data.The establishment of the trigger expressions followed a terminological selection of terms to be used as morphosyntactic units within the regular expression rules.We then exploited the entities retrieved by the trigger expressions over the tweets dataset to be used to define a classification tool.The classifier is considered to represent the connections of new malware, detected by implementing the above-mentioned steps, show within a hierarchical structure.One of the future perspectives will address the continuous enhancement of the malware classifier (ontology instances) which will be also fed by the CSIRT Settimana cibernetica enabling the mapping with the new attacks included in the MITRE official framework of CAPEC and the associated vulnerabilities.This triangular interconnection could support the association of new types of malwares with existing networked semantic flow of information related to the vulnerabilities present in the hardware, software or protocols infrastructures.This activity could represent a forecastable knowledge platform to be used by companies when it comes to considering the elements meant to be analyzed to reduce the risk of being exposed to cyberthreats.

Figure 1 .
Figure 1.Our methodology steps for tweets processing, extraction and classification

Figure 2
Figure 2 depicts the steps followed after the compilation of the source corpus and the employment of the text included in the tweets to be used as a starting point from which to begin the classification process.

Figure 2 .
Figure 2. A comprehensive view of the processing and classification steps of the tweets 1

Figure 3
Figure 3 Ontology structure for malware instances 3

Table 1 .
Details of text processing of collected tweets

Table 2 .
Term extraction output