ScamDoc: a free tool to check for scams online

Last summer, I did a 2-month internship for the French startup HERETIC which has been fighting scams on the Internet for several years now. They offered me an opportunity to test my AI-engineer skills on a practical problem: how can we use machine learning to detect how fraudulent an email or a website address looks?


A few years ago, Anthony Legros and Jean-Baptiste Boisseau created an online database of scams and frauds. Thanks to users all around the world, this participative website, www.signal-arnaques.com, is now a search engine that lists over 250 000 scams of various types: phishing, job scams, fraudulent sites…

We connected because they wanted to study how AI could help them create a next-level tool for their customers. Their goal was to design and implement a machine learning algorithm to automatically evaluate “digital identities” trust (email address or website). Our collaboration resulted in a new site called ScamDoc that’s now available online for everyone, for free, where you can check the safety of an email or a website!

It is still in beta but already shows great promise (it’s been upvoted by one of the beta-tester, check out his opinion over here).

Why this tool?

Each and everyday, we are confronted with new scams and frauds. The web is flooded with more and more strange ads, false job offers or deceitful emails… and what’s worse is that authorities are often overwhelmed and unable to react properly. So, it is time we take the matter into our own hands. What if we could automatically predict how dangerous an email or a domain name is? What if you had a simple tool to easily check for scams with a high-rate accuracy?

Today, sadly, there is no effective way to simply get a global estimate of how fraudulent an email address or a domain name seems and many people are faced with well-knitted scams it is not hard to be fooled by. However, new developments in the field of machine learning can help us deal with this issue and offer brand new solutions to the problem.

ScamDoc is here to separate the wheat from the chaff!

What is ScamDoc, exactly?

During my internship, I was tasked with conceiving, implementing and optimizing two AI models to predict as accurately as possible the chances a given e-mail or domain name is a scam. The two of them combined are now a webservice called ScamPredictor, which is directly used by ScamDoc. The proposed tool was to be easy-to-use for users, maintainable for the company and rely on open-source technologies.

The main goal was quickly complemented by another objective: the setup of many scalable and reusable small tools such as data retrievers, data parsers and processors, data cachers (in files or in a database), model predictors… I ended up creating a complete work pipeline to facilitate the management and creation of AI models similar to the ones in ScamPredictor.

How does it work?

Although I won’t get into all the details – due to professional secrecy, folks! – I thought I’d give you a gross overview of the various techniques we implemented in ScamPredictor.

Something worth noting is that ScamPredictor only uses the email or website address itself to predict the “trust” score, not the content of the scam. Therefore, we did not study or implement NLP (Natural Language Processing) methods.

At a large scale, ScamPredictor works in several steps by first gathering relevant data on the sample to examine, then computing a raw score and finally applying various modifiers to refine this result. These modifiers take into consideration extra sources of data to get further intuition of the risks associated with the email or website.

The intermediate model that produces the raw score is a bit different depending on the type of sample to analyze: email or website address.

Evaluating websites

Rating a website address mostly relies on its WHOIS data, which is kind of its identity card. By checking this data, you can learn a lot of useful information about a domain name. We implemented a Bayesian filter that searches for a bunch of binary features (i.e.: characteristics of a sample that can be encoded as ‘true/false’ values) and predicts the probability the website has of being dangerous.

Evaluating email addresses

Thanks to the a priori knowledge Anthony and Jean-Baptiste had of email scams, we quickly understood that we had to deal with various profiles of scammers. Each required a different approach and therefore a different AI model.

We discovered that a Deep Neural Network (or DNN for short) got a pretty great accuracy after only a few hours of training for some types of scams. With a well-known AI framework, we designed a classical ‘funnel shaped’ network with an input layer, several hidden layers of decreasing size and finally a 2-nodes output layer: ‘fraudulent’ or ‘non-fraudulent’.

Classical ‘funnel shaped’ DNN (not all the connections are shown here: the network is fully connected)

For other types of emails, this strategy failed, mostly because we don’t have all the information for all the samples, thus our model cannot train correctly (many of the providers used by these scammers hide their domain name’s data). We eventually made a mix of the two models I presented above and created a DNN that also relies on a Bayesian model for some preprocessing.

When you’re on ScamDoc’s website and you get a result for an input, the model usually tells you some relevant factors for the given mark, meaning the values of the most important features.

The project was coded in Python. Even if it depends on usual machine learning frameworks for DNNs and a gallery of APIs for data retrieval, Anthony, Jean-Baptise and I organized and assembled the core parts of the tool ourselves.

A great internship experience

I really enjoyed this internship because it allowed me to use my previous theoretical knowledge of machine learning algorithms and apply it on the ground.

I got to see how important it is to have good, clean data; how to analyze them to pinpoint relevant facts about it; how to reprocess it to ease the training phase of your AI models… It was really nice to see the whole working chain and not just spit out a few lines of code to call the best machine learning libraries’ methods and get a result: even if I get that ‘students’ must work on ‘case studies’, I think this sort of internship teaches you a lot more about the reality of a Machine Learning Engineer’s job!

Plus, I discovered or dived deeper in many useful tools: Git, Sphinx (for auto-documentation generation in Python),  web frameworks, database ORMs, webcrawlers…

So, thanks again to Anthony and Jean-Baptiste for this cool opportunity!

References
    1. ScamDoc’s website: https://www.scamdoc.com/
    2. Signal Arnaques’s website: https://www.signal-arnaques.com/en
    3. Sphinx’s documentation: http://www.sphinx-doc.org/en/master/
    4. I. Wikimedia Foundation, “Natural language processing” (https://en.wikipedia.org/wiki/Natural_language_processing), December 2018. [Online; last access 29-December-2018].

Leave a Reply

Your email address will not be published. Required fields are marked *