Phishing Message Detection Based on Keyword Matching
Main Article Content
Keywords
keyword matching, phishing detection, Naïve Bayes, natural language processing, stemming
Abstract
This paper proposes to use the Naïve Bayes-based algorithm for phishing detection, specifically in spam emails. The paper compares probability-based and frequency-based approaches and investigates the impact of imbalanced datasets and the use of stemming as a natural language processing (NLP) technique. Results show that both algorithms perform similarly in spam detection, with the choice between them depending on factors such as efficiency and scalability. Accuracy is influenced by the dataset configuration and stemming. Imbalanced datasets lead to higher accuracy in detecting emails in the majority class, while they struggle to classify minority-class emails. In contrast, balanced datasets yield overall high accuracy for both spam and ham email identification. This study reveals that stemming has a minor impact on algorithm performance, occasionally decreasing in accuracy due to word grouping. Balancing the dataset is crucial for improving algorithm performance and achieving accurate spam email detection. Hence, both probability-based and frequency-based Naïve Bayes algorithms are effective for phishing detection using balanced datasets. The frequency-based approach, with a balanced dataset and stemming, achieves a balanced performance between recall and precision, while the probability-based method with a balanced dataset and no stemming prioritises overall accuracy.
Downloads
References
Adebowale, M. A., Lwin, K. T., Sánchez, E. & Hossain, M. A. (2019). Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text. Expert Systems with Applications, 115, 300–313. https://doi.org/10.1016/j.eswa.2018.07.067
Aljofey, A., Jiang, Q., Rasool, A., Chen, H., Liu, W., Qu, Q. & Wang, Y. (2022). An effective detection approach for phishing websites using URL and HTML features. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-10841-5
Amir Sjarif, N. N., Mohd Azmi, N. F., Chuprat, S., Sarkan, H. M., Yahya, Y. & Sam, S. M. (2019). SMS spam message detection using term frequency-inverse document frequency and random forest algorithm. Procedia Computer Science, 161, 509–515. https://doi.org/10.1016/j.procs.2019.11.150
Barraclough, P. & Sexton, G. (2015). Phishing website detection fuzzy system modelling [Paper presentation]. 2015 Science and Information Conference (SAI). https://doi.org/10.1109/sai.2015.7237323
Baykara, M. & Gurel, Z. Z. (2018). Detection of phishing attacks [Paper presentation]. 2018 6th International Symposium on Digital Forensic and Security (ISDFS). https://doi.org/10.1109/isdfs.2018.8355389
Cook, S. (2023, 21 June). 50+ Phishing statistics, facts and trends 2017–2018. Comparitech. https://www.comparitech.com/blog/vpn-privacy/phishing-statistics-facts/
Cveticanin, N. (2023, 14 July). Phishing statistics & how to avoid taking the bait. Dataprot. https://dataprot.net/statistics/phishing-statistics/
Dalia, S. A., Hanan, A. A. A. A. & Ishraq, K. A. (2021). Effective phishing emails detection method. Turkish Journal of Computer and Mathematics Education, 12(14), 4898–4904. https://turcomat.org/index.php/turkbilmat/article/view/11456
Desolda, G., Ferro, L. S., Marrella, A., Catarci, T. & Costabile, M. F. (2022). Human factors in phishing attacks: A systematic literature review. ACM Computing Surveys, 54(8), 1–35. https://doi.org/10.1145/3469886
Frauenstein, E. D. & Flowerday, S. (2020). Susceptibility to phishing on social network sites: A personality information processing model. Computers & Security, 94, 101862. https://doi.org/10.1016/j.cose.2020.101862
Harikrishnan N B. (2021, 13 December). Confusion matrix, accuracy, precision, recall, F1 score. Medium. https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd
Jari, M. (2022). An overview of phishing victimization: Human factors, training and the role of emotions [Paper presentation]. 12th International Conference on Computer Science and Information Technology. https://doi.org/10.5121/csit.2022.121319
Julis, M. & Alagesan, S. (2020). Spam detection in SMS using machine learning through text mining. International Journal of Scientific & Technology Research, 9. Available at https://www.ijstr.org/final-print/feb2020/Spam-Detection-In-Sms-Using-Machine-Learning-Through-Text-Mining.pdf
Lin, T., Capecci, D. E., Ellis, D. M., Rocha, H. A., Dommaraju, S., Oliveira, D. S. & Ebner, N. C. (2019). Susceptibility to spear-phishing emails. ACM Transactions on Computer–Human Interaction, 26(5), 1–28. https://doi.org/10.1145/3336141
Mohamed, G., Visumathi, J., Mahdal, M., Anand, J. & Elangovan, M. (2022). An effective and secure mechanism for phishing attacks using a machine learning approach. Processes, 10(7), Article 1356. https://doi.org/10.3390/pr10071356
Mughaid, A., AlZu’bi, S., Hnaif, A., Taamneh, S., Alnajjar, A. & Elsoud, E. A. (2022). An intelligent cyber security phishing detection system using deep learning techniques. Cluster Computing, 25, 3819–3828. https://doi.org/10.1007/s10586-022-03604-4
Nurul, A. A. & Isredza, R. A. H. (2021). COVID-19 phishing detection based on hyperlink using K-nearest neighbor (KNN) algorithm. Applied Information Technology and Computer Science, 2(2), 287–301. Available from https://publisher.uthm.edu.my/periodicals/index.php/aitcs/article/view/2317
Sheikhi, S., Taghi Kheirabadi, M. & Bazzazi, A. (2020). An effective model for SMS spam detection using content-based features and averaged neural network. International Journal of Engineering, Transactions B: Applications, 33(2), 221–228. http://dx.doi.org/10.5829/ije.2020.33.02b.06
Sonowal, G. (2020). Detecting phishing SMS based on multiple correlation algorithms. SN Computer Science, 1(6). https://doi.org/10.1007/s42979-020-00377-8
Tay, Y. H., Ooi, S. Y., Pang, Y. H., Gan, Y. H., & Lew, S. L. (2023). Ensuring Privacy and Security on Banking Websites in Malaysia: A Cookies Scanner Solution. Journal of Informatics and Web Engineering, 2(2), 153-167. https://doi.org/10.33093/jiwe.2023.2.2.12