Vol. 15 No. 2 (2024)

Towards a semi-automatic classifier of malware through tweets for early warning threat detection

Claudia Lanza
University of Calabria
Lorenzo Lodi
Zanasi & Partners

Published 2024-05-15


  • Malware,
  • Classification,
  • NLP,
  • Twitter,
  • Text Mining.

How to Cite

Lanza, Claudia, and Lorenzo Lodi. 2024. “Towards a Semi-Automatic Classifier of Malware through Tweets for Early Warning Threat Detection”. JLIS.It 15 (2):101-18. https://doi.org/10.36253/jlis.it-591.

Funding data

  • Ministero dell'Università e della Ricerca
    Grant numbers PON "Ricerca e Innovazione" 2014-2020 Asse IV, Azione IV.4, Azione IV.6, avviso DM 1062 del 10.08.2021, RTD-A a regime di tempo pieno,  codice identificativo 1062_R10_INNOVAZIONE, settore concorsuale 11/A4, settore scientifico disciplinare M-STO/08.


This paper presents a method for developing a malware ontology structure by detecting malware instances on Twitter. The ontology represents a semi-automatic classifier fed by the data extracted from tweets. In particular, the automatic part of the presented methodology relies on a pattern-based approach to detect trigger expressions leading to new information about malware, whilst the manual one covers the evaluation of the results by domain-experts, who also validate the reliability of the semantic relationships within the ontology framework. We present preliminary results on the application of our methodology to tweets extracted from MalwareBazaar database showing how the documents’ collection analysis, through Natural Language Processing (NLP) tasks, can support the knowledge retrieval and documents’ classification procedures for building early warning system of detected malware. Results obtained from this research paper within the time framework of 2023 are referred to the previous version of the current social network X.


Metrics Loading ...


  1. Adem, Tahir, and Muhammed Mutlu Yapici. 2022. “A Novel Malware Classification and Augmentation Model Based on Convolutional Neural Network.” Computers & Security 112. https://doi.org/10.1016/j.cose.2021.102515.
  2. Akhtar, Muhammad Shoaib, and Tao Feng. 2022. “Malware Analysis and Detection Using Machine Learning Algorithms.” Symmetry 14 (11): 2304. DOI: https://doi.org/10.3390/sym14112304
  3. Andrei, Brazhuk. 2019. “Semantic Model of Attacks and Vulnerabilities Based on CAPEC and CWE Dictionaries.” International Journal of Open Information Technologies 7 (3): 38-41.
  4. Anicic, Darko, Paul Fodor, Sebastian Rudolph, Roland Stühmer, Nenad Stojanovic, and Rudi Studer. 2010. “A Rule-Based Language for Complex Event Processing and Reasoning.” In Web Reasoning and Rule Systems. RR 2010. Lecture Notes in Computer Science, edited by Pascal Hitzler, and Thomas Lukasiewicz, vol 6333, 4: 42–57. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-15918-3_5. DOI: https://doi.org/10.1007/978-3-642-15918-3_5
  5. Annachhatre, Chinmayee, Thomas H. Austin, and Mark Stamp. 2015. “Hidden Markov Models for Malware Classification.” Journal of Computer Virology and Hacking Techniques 11: 59–73. https://doi.org/10.1007/s11416-014-0215-x. DOI: https://doi.org/10.1007/s11416-014-0215-x
  6. Antoniou, G., van Harmelen, F. (2004). “Web Ontology Language: OWL”. In Handbook on Ontologies. International Handbooks on Information Systems, edited by Steffen Staab, and Rudi Studer. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-540-24750-0_4. DOI: https://doi.org/10.1007/978-3-540-24750-0_4
  7. Arora, Monika, and Vineet Kansal. 2019. “Character Level Embedding with Deep Convolutional Neural Network for Text Normalization of Unstructured Data for Twitter Sentiment Analysis.” Social Network Analysis and Mining 9: 12. https://doi.org/10.1007/s13278-019-0557-y. DOI: https://doi.org/10.1007/s13278-019-0557-y
  8. Auger, Alain, and Caroline Barrière. 2008. “Pattern-based Approaches to Semantic Relation Extraction: A State-of-the-Art.” Terminology 14 (1). https://doi.org/10.1075/term.14.1.02aug. DOI: https://doi.org/10.1075/term.14.1
  9. Babic, Bojan, Nenad Nesic, and Zoran Miljkovic. 2008. “A Review of Automated Feature Recognition with Rule-based Pattern Recognition.” Computers in Industry 59 (4): 321–337. DOI: https://doi.org/10.1016/j.compind.2007.09.001
  10. Akshat Bakliwal, Piyush Arora, Senthil Madhappan, Nikhil Kapre, Mukesh Singh, and Vasudeva Varma. 2012. “Mining Sentiments from Tweets.” In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, 11–18. Jeju, Korea: Association for Computational Linguistics.
  11. Barnard, Josie. 2016. “Tweets as Microfiction: On Twitter’s Live Nature and 140-Character Limit as Tools for Developing Storytelling Skills.” New Writing 13 (1): 3–16. https://doi.org/10.1080/14790726.2015.1127975. DOI: https://doi.org/10.1080/14790726.2015.1127975
  12. Bartoletti, Massimo, Stefano Lande, and Alessandro Massa. 2016. “Faderank: An Incremental Algorithm for Ranking Twitter Users.” In Web Information Systems Engineering–WISE 2016: 17th International Conference, Shanghai, China, Proceedings, Part II 17, 55–69. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-48743-4_5
  13. Blomqvist, Eva, and Kurt Sandkuhl. 2005. “Patterns in Ontology Engineering: Classification of Ontology Patterns.” ICEIS 3: 413–416.
  14. Brazhuk, Andrei. 2019. “Semantic Model of Attacks and Vulnerabilities Based on CAPEC and CWE Dictionaries.” International Journal of Open Information Technologies 7(3): 38–41.
  15. Cappelletti Rafael, and Sastry Nishanth. 2012. “IARank: Ranking Users on Twitter in Near Real-Time, Based on Their Information Amplification Potential.” International Conference on Social Informatics, 70–77. Alexandria, VA, USA. https://doi.org/10.1109/SocialInformatics.2012.82. DOI: https://doi.org/10.1109/SocialInformatics.2012.82
  16. Christodorescu, Mihai, Sanjit Jha, Sanjit A. Seshia, Dawn Song, and Randal E Bryant. 2005. “Semantics-Aware Malware Detection.” IEEE Symposium on Security and Privacy (S&P’05), Oakland, CA, USA, 2005, 32–46. https://doi.org/10.1109/SP.2005.2032–46. DOI: https://doi.org/10.1109/SP.2005.20
  17. Concone, Mário. 2012. “Twitter Event Detection: Combining Wavelet Analysis and Topic Inference Summarization.” DSIE’12, Doctoral Symposium on Informatics Engineering, 1: 11–16.
  18. Das Sarma, Anish, Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. 2010. “Ranking Mechanisms in Twitter-Like Forums.” In Proceedings of the Third ACM International Conference on Web Search and Data Mining WSDM’10, 21–30, February 4-6. New York City, New York, USA: Association for Computer Machinery. DOI: https://doi.org/10.1145/1718487.1718491
  19. Das, Tushar Kant, and P. Mohan Kumar. 2013. “BIG Data Analytics: A Framework for Unstructured Data Analysis.” International Journal of Engineering and Technology 5: 153–156.
  20. Donalds, Charlette, and Kweku-Muata Osei-Bryson. 2019. “Toward a Cybercrime Classification Ontology: A Knowledge-Based Approach.” Computers in Human Behavior 92: 403–418. DOI: https://doi.org/10.1016/j.chb.2018.11.039
  21. Drakopoulos, Georgios, Andreas Kanavos, and Athanasios K Tsakalidis. 2016. “Evaluating Twitter Influence Ranking with System Theory.” WEBIST 1: 113–120. DOI: https://doi.org/10.5220/0005811701130120
  22. Europol Public Information. 2017. “Common Taxonomy for Law Enforcement and The National Network of CSIRTs.” https://www.europol.europa.eu/cms/sites/default/files/documents/common_taxonomy_for_law_enforcement_and_csirts_v1.3.pdf.
  23. Evert, Stefan. 2008. “Corpora and Collocations.” In Corpus Linguistics: an international handbook 2, 1212–1248. Berlin, New York: De Gruyter Mouton. DOI: https://doi.org/10.1515/9783110213881.2.1212
  24. Gaglio, Salvatore, Giuseppe Lo Re, and Marco Morana. 2016. “A Framework for Real-Time Twitter Data Analysis.” Computer Communications 73: 236–242. DOI: https://doi.org/10.1016/j.comcom.2015.09.021
  25. Georgiadou, Anna, Spiros Mouzakitis, and Dimitris Askounis. 2021. “Assessing MITRE ATT&CK Risk Using a Cyber-Security Culture Framework.” Sensors 21(9): 3267. DOI: https://doi.org/10.3390/s21093267
  26. Glimm, Birte, Ian Horrocks, Boris Motik, Rob Shearer, and Giorgos Stoilos. 2012. “A Novel Approach to Ontology Classification.” Journal of Web Semantics 14: 84–101. DOI: https://doi.org/10.1016/j.websem.2011.12.007
  27. Guarino, Nicola, Daniel Oberle, and Steffen Staab. 2009. “What Is an Ontology?.” Handbook on Ontologies 1–17. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-540-92673-3. DOI: https://doi.org/10.1007/978-3-540-92673-3_0
  28. Gupta, Rishabh, and Rajesh N Rao. 2020. “Towards Semantic Noise Cleansing of Categorical Data Based on Semantic Infusion.” https://doi.org/10.48550/arXiv.2002.02238.
  29. Gutierrez, Carlos Enrique, Mohammad Reza Alsharif, Katsumi Yamashita, and Mahdi Khosravy. 2014. “A Tweets Mining Approach to Detection of Critical Events Characteristics Using Random Forest.” Int J Next-Gener Comput 5(2): 167–176.
  30. Habibi, Omar, Mohammed Chemmakha, and Mohamed Lazaar. 2023. “Performance Evaluation of CNN and Pre-trained Models for Malware Classification.” Arabian Journal for Science and Engineering: 1–15. DOI: https://doi.org/10.1007/s13369-023-07608-z
  31. Huang, Hsien-Der, Tsung-Yen Chuang, Yi-Lang Tsai, and Chang-Shing Lee. 2010. “Ontology-based Intelligent System for Malware Behavioral Analysis.” In International Conference on Fuzzy Systems, 1–6, Barcelona, Spain. doi: 10.1109/FUZZY.2010.5584325. DOI: https://doi.org/10.1109/FUZZY.2010.5584325
  32. Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. 2014. “Finding Terms in Corpora for Many Languages with the Sketch Engine.” In Proceedings of the demonstrations at the 14th conference of the european chapter of the association for computational linguistics, 56-56. Gothenburg, Sweden: Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-2014. DOI: https://doi.org/10.3115/v1/E14-2014
  33. Kang, Boojoong, KimTaekeun, Heejun Kwon, Yangseo Choi, and Eul Gyu Im. 2012. “Malware Classification Method via Binary Content Comparison.” In Proceedings of the 2012 ACM Research in Applied Computation Symposium, 316–321, New York, NY: Association for Computing Machinery. https://doi.org/10.1145/2401603.2401672. DOI: https://doi.org/10.1145/2401603.2401672
  34. Kalash, Mahmoud, Mrigank Rochan, Noman Mohammed, Neil D.B. Bruce, Yang Wang, and Farkhund Iqbal. 2018. “Malware Classification with Deep Convolutional Neural Networks.” In 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 1–5. Paris, France. https://doi.org/10.1109/NTMS.2018.8328749. DOI: https://doi.org/10.1109/NTMS.2018.8328749
  35. Kilgarriff, Adam, Pavel Rychlý, Pavel Smrž, and David Tugwell. 2008. “The Sketch Engine.” Practical lexicography: a reader: 297–306. DOI: https://doi.org/10.1093/oso/9780199292332.003.0020
  36. Kinable, Joris, and Orestis Kostakis. 2011. “Malware classification based on call graph clustering.” Journal in Computer Virology 7(4): 233–245. https://doi.org/10.1007/s11416-011-0151-y. DOI: https://doi.org/10.1007/s11416-011-0151-y
  37. Kotenko, Igor, and Elena Doynikova. 2015. “The CAPEC based generator of attack scenarios for network security evaluation.” In 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 436–441. Warsaw, Poland. https://doi.org/10.1109/IDAACS.2015.7340774. DOI: https://doi.org/10.1109/IDAACS.2015.7340774
  38. Kwon, Roger, Ashley Travis, Jerry Castleberry, Penny Mckenzie, and Sri Nikhil Gupta Gourisetti. 2020. “Cyber Threat Dictionary Using MITRE ATT&CK Matrix and NIST Cybersecurity Framework Mapping.” Resilience Week (RWS), 106–112. DOI: https://doi.org/10.1109/RWS50334.2020.9241271
  39. León-Araúz, Pilar, Antonio San Martín, and Pamela Faber. 2016. “Pattern-based Word Sketches for the Extraction of Semantic Relations.” In Proceedings of the 5th International workshop on Computational Terminology (Computerm2016), 73–82. Osaka, Japan.
  40. Lo, Siaw Ling, Raymond Chiong, and David Cornforth. 2016. “Ranking of High-value Social Audiences on Twitter.” Decision Support Systems 85: 34–48. DOI: https://doi.org/10.1016/j.dss.2016.02.010
  41. Lohmann, Steffen and Vincent Link, Eduard Marbach, and Stefan Negru. 2015. “WebVOWL: Web-based Visualization of Ontologies.” In Knowledge Engineering and Knowledge Management: EKAW 2014 Satellite Events, VISUAL, EKM1, and ARCOE-Logic, Linköping, Sweden, November 24-28, 2014. Revised Selected Papers, 19: 154–158. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-17966-7_21
  42. Mathews, Sherin Mary. 2019. “Explainable Artificial Intelligence Applications in NLP, Biomedical, and Malware Classification: A Literature Review.” Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing, 998. Cham: Springer. https://doi.org/10.1007/978-3-030-22868-2_90. DOI: https://doi.org/10.1007/978-3-030-22868-2_90
  43. Mirza, Qublai K. Ali., Irfan Awan, and Muhammad Younas. 2018. “CloudIntell: An Intelligent Malware Detection System.” Future Generation Computer Systems 86: 1042–1053. DOI: https://doi.org/10.1016/j.future.2017.07.016
  44. Montangero, Manuela, and Marco Furini. 2015. “Trank: Ranking Twitter Users According to Specific Topics.” In 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC), 767–772. Las Vegas, NV, USA. https://doi.org/10.1109/CCNC.2015.7158074. DOI: https://doi.org/10.1109/CCNC.2015.7158074
  45. Noro, Tomoya, Fei Ru, Feng Xiao, and Takehiro Tokuda. 2013. “Twitter User Rank Using Keyword Search.” Information Modelling and Knowledge Bases XXIV. Frontiers in Artificial Intelligence and Applications 251: 31–48.
  46. Pascanu, Razvan, Jack W. Stokes, Hermineh Sanossian, Mady Marinescu, and Anil Thomas. 2015. “Malware Classification with Recurrent Networks.” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1916-1920. South Brisbane, QLD, Australia. https://doi.org/10.1109/ICASSP.2015.7178304. DOI: https://doi.org/10.1109/ICASSP.2015.7178304
  47. Qaiser, Shahzad, and Ramsha Ali. 2018. “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents.” International Journal of Computer Applications 181(1): 25–29. DOI: https://doi.org/10.5120/ijca2018917395
  48. Rastogi, Nidhi, Sharmishtha Dutta, Mohammed J. Zaki, Alex Gittens, and Charu Aggarwal. 2020. “Malont: An Ontology for Malware Threat Intelligence.” In International Workshop on Deployable Machine Learning for Security Defense, 28–44. Cham: Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-59621-7_2
  49. Sabottke, Carl, Octavian Suciu, and Tudor Dumitraș. 2015. “Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting {Real-World} Exploits.” In 24th USENIX Security Symposium (USENIX Security 15), 1041–1056.
  50. Sahu, Manish Kumar, Manish Ahirwar, and A. Hemlata. 2014. “A Review of Malware Detection Based on Pattern Matching Technique.” International Journal of Computer Science and Information Technologies (IJCSIT) 5 (1): 944–947.
  51. Sankaranarayanan, Jagan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. “Twitterstand: News in Tweets.” In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 42–51. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1653771.1653781. DOI: https://doi.org/10.1145/1653771.1653781
  52. Singh, Jagsir, and Jaswinder Singh. 2018. “Challenge of Malware Analysis: Malware Obfuscation Techniques.” International Journal of Information Security Science 7(3): 100–110.
  53. Sivakumar, Ramakrishnan, and P.V. Arivoli,. 2011. “Ontology Visualization PROTÉGÉ Tools–A Review.” International Journal of Advanced Information Technology (IJAIT) 1: 1-11. http://dx.doi.org/10.5121/ijait.2011.1401.
  54. Subbian, Karthik, and Prem Melville. 2011. “Supervised Rank Aggregation for Predicting Influencers in Twitter.” In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, 661–665. Boston, MA, USA. https://doi.org/10.1109/PASSAT/SocialCom.2011.167. DOI: https://doi.org/10.1109/PASSAT/SocialCom.2011.167
  55. Vasiliev, Yuli. 2020. Natural Language Processing with Python and spaCy: a practical introduction. San Francisco, California, USA: No Starch Press.
  56. Tang, Yonghe, Xuyan Qi, Jing Jing, Liu Chunling, and Weiyu Dong. 2023. “BHMDC: A Byte and Hex N-gram Based Malware Detection and Classification Method.” Computers & Security 103118. DOI: https://doi.org/10.1016/j.cose.2023.103118
  57. Tekerek, Adem, and Muhammed Mutlu Yapici. 2022. “A Novel Malware Classification and Augmentation Model Based on Convolutional Neural Network.” Computers & Security 112: 102515. DOI: https://doi.org/10.1016/j.cose.2021.102515
  58. Zareen, Syed, Padia Ankur, Tim Finin, Lisa Mathews, and Joshi Anupam. 2016. “UCO: A Unified Cybersecurity Ontology.” In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto, California, USA: AAAI Press.
  59. Wang, Xiao Hang, D. Qing Zhang, Tao Gu, and Hung, Keng Pung. 2004. “Ontology-Based Context Modeling and Reasoning Using OWL.” In IEEE Annual Conference on Pervasive Computing and Communications Workshops, 2004. Proceedings of the Second, 18–22. Orlando, FL, USA. https://doi.org/10.1109/PERCOMW.2004.1276898. DOI: https://doi.org/10.1109/PERCOMW.2004.1276898
  60. Xiong, Wenjun, Emeline Legrand, Oscar Åberg, and Robert Lagerström. 2022. “Cyber Security Threat Modeling Based on the MITRE Enterprise ATT&CK Matrix.” Software and Systems Modeling 21.1: 157–177. DOI: https://doi.org/10.1007/s10270-021-00898-7
  61. Xu, Xin, and Hubo Cai. 2021. “Ontology and Rule-Based Natural Language Processing Approach for Interpreting Textual Regulations on Underground Utility Infrastructure.” Advanced Engineering Informatics 48, 101288. DOI: https://doi.org/10.1016/j.aei.2021.101288
  62. Yamaguchi, Yuto, Tsubasa Takahashi, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2010. “Turank: Twitter User Ranking Based on User-Tweet Graph Analysis.” In Web information systems engineering–WISE 2010: 11th International Conference, Hong Kong, China, December 12-14, 2010. Proceedings, 11, 240–253. Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-642-17616-6_22