web

web / MaliciousDomainClassifier / 0.1.1

README.md

The application classifies domain names as legitimate or malicious. Malicious domains earn their label by engaging in malicious activity, such as botnets, phishing, and malware hosting. In order to defeat security systems, attackers use domain names that are generated by algorithms. To detect domains which may be malicious, the algorithm uses a model based on linguistic features that distinguish regular domains from those that are algorithmically generated.

For more context on how it works, check out the blog post: Using H2O.ai to Classify Domains in Production



Input Format: domain(s): String or String[] Output Format (object for each input)

[{
    domain: String,
    result: String (values: safe, malicious),
    probSafe: double,
    probMalicious: double
},  ....]



Example Input - Single Domain

Input:

"algorithmia.com"

Output:

[
  {
    "domain": "algorithmia.com",
    "result": "safe",
    "probSafe": 0.9958709063793443,
    "probMalicious": 0.004129093620655737
  }
]



Example Input - Batch Input

Input:

["algorithmia.com", "h2o.ai", "weoiuwerjdkfssf.net"]

Output:

[
    {
        "domain": "algorithmia.com",
        "result": "safe",
        "probSafe": 0.9958709063793443,
        "probMalicious": 0.004129093620655737
    },
    {
        "domain": "h2o.ai",
        "result": "safe",
        "probSafe": 0.9998094655020227,
        "probMalicious": 0.00019053449797729296
    },
    {
        "domain": "weoiuwerjdkfssf.net",
        "result": "malicious",
        "probSafe": 0.001264585685433861,
        "probMalicious": 0.9987354143145661
    }
]



Performance

On average:

  • A single-input transaction takes ~10ms to run
  • A batch input of 10 domains takes ~17ms to run

Algorithmia's cluster overhead is ~15ms to ~30ms, depending on competing demand. Calling the algorithm from non-AWS network will introduce substantial latency (in the picture below, see Avg. Total Time vs. Avg. Run Time). To investigate your run-time versus latency, run the algorithm from cURL and examine the metadata.duration value, which reflects the actual running time of the algorithm.

Calling the algorithm with batching will substantially increase your throughput. In the snapshots below, we show having 50 parallel connections (referred to here as Slots), each making a sequence of blocking REST API calls, with each call classifying 10 domains. 

Speed Test from Office WiFi (3/24/2017 5pm PST)

Speed Test from AWS US East-1 (3/27/2017 4pm PST)

  • Custom script for speed testing, 50 parallel slots, batching with 10 domains



Credits - Based on H2O Logistic Regression

This algorithm is based on this model, a sample from H2O. It demonstrates how to use H2O's POJO for fast inference, Jython for pre-processing, and Algorithmia for on-demand distributed scale.