Jordan Holland will present his FPO "A Generic Framework for Network Traffic Analysis" on May 3, 2022 at 10am via Zoom.

Zoom link: https://princeton.zoom.us/j/92501308623

Jordan's committee is as follows:
Advisers: Nick Feamster and Prateek Mittal;
Readers: Jonathan Mayer and Vitaly Shmatikov (Cornell Tech);
Examiners: Nick Feamster, Ravi Netravali, and Prateek Mittal

All are welcome to attend.

Abstract:
Researchers and practitioners rely on network traffic analysis techniques
for a variety of critical network security and network management tasks. 
Ever-increasing traffic volumes and encryption rates have rendered traditional, 
signature-based solutions less effective. As such, newly developed methods 
almost universally leverage machine techniques. The development of new
machine-learning based traffic analysis techniques shares a common methodological
pipeline: curate a network traffic dataset, create a system to separate and 
associate labels with the traffic (e.g. flows, applications), engineer features
for the task, and finally train models using the engineered features. Although this 
methodological pipeline shared across tasks, each instantiated pipeline is 
custom-built for the task at hand, requiring new traffic processing systems, 
features, and models.

This dissertation questions the assumption that each stage in the shared 
methodological pipeline should be custom-built to each task, exploring if 
several stages of the common pipeline can be better accomplished using generic 
techniques. First, we examine the process of feature engineering and model 
training--two of the most manual and painstaking steps for any traffic analysis task.
We develop nPrint, a unified packet representation that is amenable to
representation learning and model training for a variety of tasks. We then
integrate nPrint with automated machine learning to produce
nPrintML, a generic feature engineering and model training solution.

Next, we study the data collection and data processing steps of the common
traffic analysis pipeline. Unlike other disciplines, such as image recognition, 
no standard dataset format or ''canonical'' task exists, forcing researchers to
develop custom dataset formats and processing systems for each task. We survey 
existing literature to show that this approach has led to a reproducibility 
crisis, finding that the lack of a standardized dataset format and the extensive 
usage of ambiguous terminology are primary causes.
We use these findings to develop pcapML, a system that enables reproducible
network traffic analysis by providing a standardized dataset format that
removes ambiguity in the definitions of traffic analysis tasks.

The contributions chart new directions in network traffic analysis, demonstrating 
that generic methods can outperform many custom-built approaches and significantly 
enhance the ability to develop, reproduce, and compare new methods.