Jordan Holland will present his Pre-FPO " New Directions in Network Traffic Analysis" on Thursday, October 7, 2021 at 1PM via Zoom.

 

Zoom link: https://princeton.zoom.us/j/91304375990

 

Committee:   Examiners: Nick Feamster (Adviser), Prateek Mittal, Ravi Netravali

                       Readers:  Jonathan Mayer and Vitaly Shmatikov (Cornell)

 

All are welcome to attend.

 

Title: New Directions in Network Traffic Analysis

 

Abstract:

The field of network traffic analysis has long focused on creating bespoke solutions for specific problems. More recently, the field has examined the effectiveness of machine learning techniques for a wide range of traffic analysis problems. This approach follows a specific pattern: collect network traffic, create a data processing pipeline to appropriately separate the traffic (e.g. traffic flows, applications, devices) and attach labels to the separated traffic, hand-engineer features specific to the task, and finally train models on the engineered features. This methodology, combined with the number and variety of problems comprising network traffic analysis, has led to the creation of custom analysis pipelines for each task, requiring custom data processing pipelines, new features, and new models.

 

In this talk, we discuss the feasibility of using generalizable techniques for multiple steps in the typical traffic analysis pipeline. First, we introduce nPrint, a tool that generates a unified packet representation that is amenable for representation learning and model training. We integrate nPrint with automated machine learning (AutoML), resulting in nPrintML, an open-source system that eliminates manual feature engineering and model selection for a number of problems.

 

Next, we examine how custom data processing pipelines affect reproducibility in the field. We inspect work leveraging multiple popular datasets and find that current practices render both comparing and reproducing work difficult. We then introduce pcapML, an open-source system that standardizes traffic analysis tasks at the dataset level, improving reproducibility and lowering the barrier to entry for testing new traffic analysis techniques.