Learning from the Ones that Got Away: Detecting New Forms of Phishing Attacks

Tags: , ,


The aim of this project is to identify the phishing attacked mails by using the Semi-Automated Feature generation for Phish Classification and show that mail into separate tag also used a training set to identify phishing attacked mails.

Existing System:

Phishing continues to be a viable attack vector despite progress in production-deployed phishing filters. Reports from the Anti-Phishing Working Group (APWG) indicate that incidences of phishing attacks have been on an increasing trend, the number of domain names used for phishing reached an all-time high. The prevalence is based on the fact that defensive measures tend to be fragile to change, e.g., comprised of rigid regular expressions that detect specific patterns in the text. Another challenge is the transient nature of the domains hosting phishing content, making it difficult to rely on detection techniques that use URLs. In APWG’s 4th quarter of 2016 report, the reported increase in URL redirection as a technique to obfuscate phishing websites exemplifies the challenges associated with detection using URLs and blacklists. These trends suggest that alternative features within emails must be identified to detect phishing messages accurately. Phishing messages subvert filters that rely on known data, blacklists of known IP addresses, or lexical analysis. Subversion only requires minor structural changes to the phishing message and avoiding the use of blacklisted IP addresses.

Proposed System:

1) We provide a method to extract features from freeform emails. These features help to reduce common subterfuges used in crafting phishing emails. We incorporate synonym analysis, Freebase, and NER into our classifier, an amalgamation that has not been presented before.

2) Prior work in NLP for spam or phishing detection falls short in handling the real-world challenges that our data brings out. Specifically, our work is an improvement of feature selection portable and based on empirical observation of evolving datasets.

3) We demonstrate feasibility of online learning, thereby enabling a practical deployment in which new manually flagged emails can incrementally train the system to improve it continuously without paying the cost of complete retraining on the entire corpus.