Improving Phishing Email Detection Using the Hybrid Machine Learning Approach
Main Article Content
Keywords
machine learning, phishing email detection, hybrid classification
Abstract
Phishing emails pose a severe risk to online users, necessitating effective identification methods to safeguard digital communication. Detection techniques are continuously researched to address the evolution of phishing strategies. Machine learning (ML) is a powerful tool for automated phishing email detection, but existing techniques like support vector machines and Naive Bayes have proven slow or ineffective in handling spam filtering. This study attempts to provide a phishing email detector and reliable classifier using a hybrid machine classifier with term frequency-inverse document frequency (TF-IDF) and an effective feature extraction technique (FET) on a real-world dataset from Kaggle. Exploratory data analysis is conducted to enhance understanding of the dataset and identify any conspicuous errors and outliers to facilitate the detection process. The FET converts the data text into a numerical representation that can be used for ML algorithms. The model’s performance is evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve and area under the ROC curve metrics. The research findings indicate that the hybrid model utilising TF-IDF achieved superior performance, with an accuracy of 87.5%. The paper offers valuable knowledge on using ML to identify phishing emails and highlights the importance of combining various models.
Downloads
References
Akashsurya156. (2020). Phishing Email Collection. Kaggle. https://www.kaggle.com/datasets/akashsurya156/phishing-paper1
Bhandari, A. (2023, March 13). Understanding & Interpreting Confusion Matrices for Machine Learning (Updated 2023). https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning
BYJU'S. (n.d.). Accuracy And Precision - Definition, Examples, Need for Measurement. BYJUS. https://byjus.com/physics/accuracy-precision-measurement/
Chandra, J. V., Challa, N., & Pasupuleti, S. K. (2019, October). Machine Learning Framework to Analyze Against Spear Phishing. International Journal of Innovative Technology and Exploring Engineering, 8(12). https://doi.org/10.35940/ijitee.l3802.1081219
Dhiraj, K. (2019, June 14). Top 4 Advantages and Disadvantages of Support Vector Machine or SVM. Retrieved from https://dhirajkumarblog.medium.com/top-4-advantages-and-disadvantages-of-support-vectormachine-or-svm-a3c06a2b107
Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing Email Detection Using Improved RCNN Model with Multilevel Vectors and Attention Mechanism. IEEE Access, 7, 56329–56340. https://doi.org/10.1109/ACCESS.2019.2913705
Form, L. M., Chiew, K. L., Sze, S. N. & Tiong, W. T. (2022, September 25). Phishing Email Detection Technique by Using Hybrid Features. 2015 9th International Conference on IT in Asia (CITA) (p. 5). https://doi.org/10.1109/cita.2015.7349818
Gallo, L., Maiello, A., Botta, A., & Ventre, G. (2021). 2 Years in the Anti-Phishing Group of a Large Company. Computers and Security, 105, 102259. https://doi.org/10.1016/j.cose.2021.102259
Ganesan, K. (2019, December 5). 10+ Examples for Using CountVectorizer. Kavita Ganesan, Ph.D. https://kavita-ganesan.com/how-to-use-countvectorizer
Hall, C. (n.d.). Phishing Email Data by Type. www.kaggle.com. https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type
Harrison, O. (2018, September 10). Machine Learning Basics with the K-Nearest Neighbors Algorithm. Medium; Towards Data Science. https://towardsdatascience.com/machine-learning-basics-with-the-knearest-neighbors-algorithm-6a6e71d01761
IBM. (n.d.). What Is Random Forest? Retrieved from https://www.ibm.com/topics/random-forest#:~:text=Random%20forest%20is%20a%20commonly,both%20classification%20and%20regression%20problems
IBM. (n.d.). What Are Naïve Bayes Classifiers? Retrieved from https://www.ibm.com/topics/naive-bayes#:~:text=The%20Na%C3%AFve%20Bayes%20classifier%20is
Jawale, D. S., Diksha, S., Jawale, K. R., & Shinkar, K. R. (2018). Hybrid Spam Detection Using Machine Learning. International Journal of Advance Research, Ideas and Innovations in Technology, 4(2), 1–6. https://www.ijariit.com/manuscript/hybrid-spam-detection-using-machine-learning
Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K.(2020). Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework. IEEE Access, 8, (pp. 154759–154788). https://doi.org/10.1109/access.2020.3017082
Kolmar, C. (2023, March 30). 75 Incredible Email Statistics [2023]: How Many Emails Are Sent Per Day? Retrieved from https://www.zippia.com/advice/how-many-emails-are-sent-per-day
Kontsewaya, Y., Antonov, E., & Artamonov, A. (2020). Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Computer Science, 190, 479–486. Retrieved https://doi.org/10.1016/j.procs.2021.06.056
Narkhede, S. (2018, June 26). Understanding AUC – ROC Curve. Medium; Towards Data Science. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Raza, M., Jayasinghe, N. D., & Muslam, M. M. (2022). A Comprehensive Review on Email Spam Classification Using Machine Learning Algorithms. 2021 International Conference on Information Networking (ICOIN), (pp. 1–6). https://doi.org/10.1109/icoin50884.2021.9334020
Saini, A. (2021, August 29). Decision Tree Algorithm – A Complete Guide. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm
Shafiq, M., Ng, H., Yap, T. T. V., & Goh, V. T. (2022). Performance of Sentiment Classifiers on Tweets of Different Clothing Brands. Journal of Informatics and Web Engineering, 1(1), 16-22.
Toolan, F., & Carthy, J. (2022). Feature Selection for Spam and Phishing Detection. 2010 eCrime Researchers Summit, Dallas, TX, USA. (pp. 1–12). https://doi.org/10.1109/ecrime.2010.5706696
Vade Secure. (n.d.). Q1 2023 Phishing and Malware Report: Phishing Increases 102% QoQ. https://www.vadesecure.com/en/blog/q1-2023-phishing-and-malware-report-phishing-increases-102-qoq
Vazhayil, A., Harikrishnan, N. B., Vinayakumar, R., & Soman, K. P. (2018). Phishing Email Detection Using Classical Machine Learning Techniques. In Proceedings of the 1st AntiPhishing Shared Pilot at 4th ACM International Workshop on Security and Privacy Analytics (IWSPA, 2018), (pp. 1–8). Arizona. https://ceur-ws.org/Vol-2124/paper_11.pdf
Wijaya, A., & Bisri, A. (2016). Hybrid Decision Tree and Logistic Regression Classifier for Email Spam Detection. 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE) (p. 4). https://doi.org/10.1109/iciteed.2016.7863267