NSF NeTSE: Unsupervised Flow-Based Clustering

Also supported by Cisco Systems URP gift and prior DHS/NSF EMIST/DETER project


G. Kesidis and D.J. Miller (PSU PIs)



            Under a no-cost extension of our earlier grant on cyber security, DHS/NSF EMIST, we studied then recent papers on flow classification by teams from the University of Cambridge and University of California at Berkeley LBL/LBNL (cited in Zou et al. and Celik et al.) involving recorded network packet-flows. The LBL datasets and similar others are available through DHS PREDICT. Together with our first two graduate students supported on this grant, G. Zou and Z.B. Celik, we investigated the flow classification results, but omitted the use of layer-4 port numbers and considered the hazards of using timing-based features to detect salted attack activity (as GATech’s BotMiner, cited in Celik et al.), i.e., attack activity recorded in another domain.  We discovered that flow classifiers using the features identified by the University of Cambridge (even augmented with some additional features) did not well separate HTTP botnet command-and-control activity (particularly Zeus) from known web activity recorded at LBL. With our graduate students F. Kocak and J. Raghuram, this negative result led us to focus on this specific problem, to consider an expanded feature set from the packet flow data-samples, and to consider the more complex problem of unsupervised/anomaly detection (of “unknown unknowns”). In summary, we formulated port-80 zero-day attack-detection benchmark experiments involving:

·      Realistic known (nominal, background) port-80 activity from LBL.

·      Realistic attack activity, namely Zeus botnet command-and-control, recorded from another (honeynet) domain representing a zero-day attack (unknown unknowns) – this activity realistically salted or highly domain-dependent (e.g., timing based) features not considered.

·      Incumbent flow classifiers that were unable to well-separate the zero-day attack activity from the known activity, particularly in the unsupervised/anomaly-detection setting.  




         Selection of papers and personnel participating in work supported in whole or part by this grant:


·      J. Raghuram, D.J. Miller and G. Kesidis. Anomaly detection of synthetic DNS domain names. In Proc. NSF US-Egypt Workshop on Cyber Security, Cairo, Egypt, JARE Springer, May 28, 2013.

·      D. J. Miller, F. Kocak and G. Kesidis. Sequential Anomaly Detection in a batch with growing number of tests: Applications to network intrusion detection. In Proc. IEEE MLSP, Santander, Spain, Sept. 2012.

·      J. Raghuram, D.J. Miller and G. Kesidis. Semisupervised domain adaptation for mixture model based classifiers. In Proc. CISS, Princeton University, March 2012.

·      Z.B. Celik, J. Raghuram, G. Kesidis and D.J. Miller. Salting public traces with attack traffic to test flow classifiers. In Proc. USENIX Cyber Security Experimentation and Test (CSET) Workshop, San Francisco, Aug. 8, 2011.

·      G. Zou, G. Kesidis and D.J. Miller. A Flow Classifier with Tamper-resistant Features and an Evaluation of Its Portability to New Domains. IEEE JSAC Special Issue on Advances in Digital Forensics for Communications and Networking, Aug. 2011.



Subsequent related work (supported by Cisco URP gift):


·    Z. Qiu, D.J. Miller,  and G. Kesidis. Detecting Clusters of Anomalies on Low-Dimensional Feature Subsets with Application to Network Traffic Flow Data. In Proc. IEEE MLSP, Boston, Sept. 2015.

·    Anomaly Detection Software: https://github.com/zhicongqiu/GAD_MLSP2015

·    Z. Qiu, D.J. Miller and G. Kesidis. Semisupervised and Active Learning with Unknown or Label-Scarce Categories. IEEE Trans. on Neural Networks and Learning Systems (TNNLS), Jan. 2016.

·    F. Kocak, D.J. Miller and G. Kesidis. Detecting Anomalous Latent Classes in a Batch of Network Trac Flows. In Proc. CISS, Princeton, March 2014.