TNC22: Q-Factor: Real-time data transfer optimization

At TNC22, Jeronimo Beserra’s presentation covered the motivations to tune data-transfer hosts, how the Q-Factor leverages In-band Network Telemetry (INT), the use of Linux socket level information for tuning, and results from tuning network endpoints.

Developers, scientists, and campus IT staff are investing significant effort to facilitate the efficient transfer of large datasets between sites that span long geographic distances. A time-consuming and costly part of the effort is tuning the endpoints of the data transfer. Finding the best combination for data-transfer optimization parameters has a high cost, because the currently applied data-transfer optimization processes are manual, time-consuming, and depend on network conditions. Moreover, manual optimization processes require a broad set of skills and expertise not always available at campuses or network operators.

Q-Factor is a framework that enables high-speed data transfer tuning and optimization based on real-time network state information provided by programmable data planes and network sockets. Q-Factor addresses data transfer tuning and optimization processes by changing how network endpoints, including Data Transfer Nodes (DTNs), consume network state information. Q-Factor leverages two innovations in the networking ecosystem: the In-band Network Telemetry (INT) application of P4 programmable network devices, including smart NICs, and the advanced network configuration and monitoring frameworks of newer Linux kernels such as eBPF (enhanced Berkeley Packet Filters) and XDP (Express Data Path). In particular, INT can add network state information (also known as network telemetry or network metadata) to user packets being forwarded by the programmable switches, while eBPF/XDP programs can extract information and collect additional socket-level metrics from the end hosts. 

Q-Factor is a novel approach that collects and merges network and socket state information from P4-enabled network devices and BPF hooks in order to dynamically tune the DTN’s operating system’s TCP/IP level properties. The innovation of Q-Factor is a Telemetry Agent that is capable of dynamically tuning user- and kernel-space parameters using network and socket state information. User applications do not need to be changed to work with Q-Factor. The design of Q-Factor will be presented to explain how it’s able to perform data transfer tuning without changing user applications. 

The presentation covered:

  • Motivations to tune data transfer hosts. One motivation is the condition TCP congestion control algorithms use to detect network congestion, that results in packet drops. A second motivation is how to detect if queues are forming. Queue occupancy leads to packet drops and jitter, and it varies per packet received. A third motivation is to provide users a way to enable high-speed data transfer tuning without requiring changes to their applications.
  • Q-Factor leverages In-band Network Telemetry (INT) to record network telemetry information in packets while they traverse a path between two endpoints in the network. INT will be presented, including a description of what it is and how it works, and of the real-time granular data in INT reports. INT reports contain the data to measure if queues are forming. How Q-Factor leverages INT to mitigate buffer utilization, and how to mitigate hop delay and jitter will be presented.
  • Q-Factor uses Linux socket level information for tuning that is collected by eBPF programs in the Linux kernel. We will present what information can be extracted from recent Linux kernels and how BPF programs can be used to tune socket level information (such as receive and send windows, and retransmission timers). We show how we collect eBPF metrics from the hosts using Kafka streams.
  • Results from tuning network endpoints using the Q-Factor Telemetry Agent will be presented. Findings from the evaluation of different tuning techniques will be presented. Finally, experiences from developing observation-based learning techniques to mitigate bottlenecks by consuming queue occupancy and hop delay telemetry data will be presented.

Q-Factor is a project led by Florida International University (FIU) and the Energy Sciences Network (ESnet) and funded by the U.S. National Science Foundation (NSF).

Download the presentation here: https://tnc22.geant.org/sessions/#s9