Evaluation of Hash Functions for Multipoint Sampling in IP Networks

Diploma Thesis 2008 104 Pages

Computer Science - Applied







List of Figures

List of Tables

1 Introduction
1.1 Measurements in IP Networks
1.2 Active and Passive Measurements
1.3 Passive Multipoint Measurements
1.4 Measurement Architecture and Measurement Process
1.5 Incentives for Sampling
1.6 Selection Techniques
1.6.1 Sampling
1.6.2 Filtering
1.7 Comparison of Selection Techniques
1.8 Scenarios for Hash-Based Selection
1.9 Problem Statement and Target of this Work
1.9.1 Problem Statement
1.9.2 Targets of this Work
1.10 Document Structure

2 State of Art
2.1 Passive Multipoint Measurements
2.2 Packet Header Fields useful for Hash-Based Selection
2.3 Existing Evaluation Approaches for Hash Functions

3 Evaluation of Packet Content
3.1 Approach
3.1.1 Applicable Header Fields
3.1.2 Entropy per Byte
3.2 Traces Analyzed
3.3 Entropy Evaluation
3.3.1 IPv4
3.3.2 TCP
3.3.3 UDP
3.3.4 ICMP
3.3.5 IPv6
3.3.6 Conclusion Entropy Analysis
3.4 Hash Input Collisions Evaluation
3.4.1 Input Collisions
3.4.2 Experimental Setup
3.4.3 Results

4 Evaluation of Hash Functions
4.1 Desired Properties
4.2 Collection of Hash Functions
4.3 Security Issues for Hash-Based Selection
4.4 Performance Measurements
4.4.1 Experimental Setup
4.4.2 Results
4.5 Avalanche Criterion
4.5.1 Approach
4.5.2 Experimental Setup
4.5.3 Results
4.6 Independence of Sampling Decision and Representativeness . . .
4.6.1 Chi-Square Independence Tests
4.6.2 Introduction to Person's Goodness-Of-Fit Test
4.6.3 Derivation of Goodness-of-Fit from Independence Test . .
4.6.4 Parameters for the Chi-Square Tests
4.6.5 Multiple Independence Tests
4.6.6 Independence Sampling Decision and Packet Length . . .
4.6.7 Independence Sampling Decision and Protocol
4.6.8 Independence Sampling Decision and Byte Value
4.7 Conclusion Hash Function Evaluation

5 Applicability of Hash-Based Selection on other Traces
5.1 Chi-Square Independence Tests for other Traces
5.1.1 Traffic Traces
5.1.2 Measurement Setup
5.1.3 Measurement Results
5.2 Trace Analysis
5.2.1 Reasons for Identical Packets
5.2.2 Consequences for Hash-based Selection

6 Hash Functions for Packet ID Generation
6.1 Introduction
6.2 Approach
6.3 Uniformity of Hash Value Distribution
6.3.1 Experimental Setup
6.3.2 Results
6.4 Collision Probability
6.5 Theoretical Probability
6.5.1 Initial Assumptions
6.5.2 Mathematical Model for Collision Probability
6.6 Empirical Collision Probability
6.6.1 Measurement Setup
6.6.2 Measurement Results
6.7 Packet ID and Selection Hash Value
6.7.1 Selection Range Adjustment for One Hash Function ...
6.7.2 Selection Range Adjustment for Two Hash Functions ...
6.7.3 Difference Between One and Two Hash Function Approach
6.7.4 Conclusion

7 Conclusion
7.1 Packet Content Evaluation
7.2 Hash Functions for Hash-Based Selection
7.3 Emulation of Random Sampling
7.4 Influence of Hash Input Collisions
7.5 Hash Functions for Packet ID generation
7.6 Further Work



A Acronyms

B Symbols

Erklärung der Urheberschaft


Network Measurements play an essential role in operating and developing to­day’s Internet. A variety of measurement applications demand for multipoint network measurements, e.g. service providers need to validate their delay guar­antees from Service Level Agreements and network engineers have incentives to track where packets are changed, reordered, lost or delayed. Multipoint measure­ments create an immense amount of measurement data which demands for high resource measurement infrastructure. Data selection techniques, like sampling and filtering, provide efficient solutions for reducing resource consumption while still maintaining sufficient information about the metrics of interest. But not all selection techniques are suitable for multipoint measurements; only deterministic filtering allows a synchronized selection of packets at multiple observation points. Nevertheless a filter bases its selection decision on the packet content and hence is suspect to bias, i.e the selected subset is not representative for the whole pop­ulation. Hash-based selection is a filtering method that tries to emulate random selection in order to obtain a representative sample for accurate estimations of traffic characteristics

The subject of the thesis is to assess which hash function and which packet con­tent should be used for hash-based selection to obtain a seemingly random and unbiased selection of packets. This thesis empirically analyzes 25 hash functions and different packet content combinations on their suitability for hash-based selection. Experiments are based on a collection of 7 real traffic groups from different networks


Netzwerk Messungen sind essentiell für die Verwaltung und Weiterentwicklung des heutigen Internets. Es existieren immer mehr Messapplikation die Daten vom mehreren Punkten benütigen, z.B. müssen Internet Provider ihre Vereinbarungen über maximale Ubertragungsverzögerung verifizieren und Netzwerk Entwickler mochten wissen, wo im Netzwerk Internetpakete verloren oder verzügert wer­den. Mehrpunktmessungen haben aber den Nachteil, dass sie eine große Menge von Messdaten produzieren, deren Verwaltung und Speicherung hohe Messin­frastrukturkosten mit sich ziehen. Es existieren einige Stichprobenmethoden die es ermüoglichen die Datenmenge zu reduzieren und die Messdaten zu schüatzen. Nicht alle Stichprobenverfahren sind geeignet für Mehrpunktmessungen, nur de­terministische Filter gewahrleisten, dass an jedem Messpunkt die gleiche Stich­probe genommen wird. Allerdings ist die Selektionentscheidung für ein Paket bei einem Filter deterministisch abhüangig von dessen Inhalt und von daher ist nicht gewahrleistet, dass es sich bei den selektierten Paketen um eine repräsentative Stichprobe handelt. Ein auf einer Hashfunktion basierendes Filterverfahren soll dazu verwandt werden eine Zufallstichprobe nachzubilden um moglichst genaue Schützungen der Messdaten zu ermoglichen

In dieser Arbeit wird analysiert welche Hash Funktion und welche Paketinhalte für dieses Verfahren verwendet sollen, um eine müglichst unverzerrte Schatzung zu erhalten. Es werden 25 Hash Funktionen und verschiedene Paketinhalte auf ihre Eignung für das Selektionsverfahren untersucht


First of all I would like to thank Tanja Zseby and Carsten Schmoll for giving me the opportunity to write this diploma thesis in a research centered topic at Fraunhofer Fokus. Their subtle supervision gave me confidence in my own ideas and helped me improve my researching ability. I am also very grateful that Prof. Dr.-Ing. Adam Wolisz pointed out critical issues as the adviser at the TU Berlin. Many thanks go to my colleagues Nikolaos Chatzis and Thomas Hirsch for their help and friendship. When I was struggling to the mist of C++ pointer or droughts of ideas they showed me the right way out of misery

Special thanks go to Nora Eisemann who I could talk to about statistical matters. With her knowledge she helped me to tweak equations and make the calculations more worthwhile

Last but not least I would like to thank my family. My father’s joy about me doing research and my mother’s support encouraged me to delve into the work. Thanks to my beloved Katja, the person behinde strength when no one else can

List of Figures

1 Multipoint Measurements
2 Simplified Measurement Architecture and Measurement Process .
3 Packet Selection
4 Sampling
5 Hash-Based Packet Selection
6 Input Collision of 2 Different Packets
7 Trace Disambiguation of Two Packets with Same Packet ID . . .
8 IPv4 Header
9 Information Efficiency IPv4
10 TCP Header
11 Information Efficiency TCP
12 UDP Header
13 Information Efficiency UDP
14 Information Efficiency ICMP
15 IPv6 Header
16 Information Efficiency IPv6
17 Performance of Hash Functions
18 Example Results of Avalanche Criterion
19 Distribution of Test Statistic T for Independence Test on Length
20 Distribution of Test Statistic T for Independence Test Protocol .
21 Independence Test on Byte for Simple Hash Function
22 Performance of Recommended Hash Functions
23 Independence Test Statistic T - Including and Without Collisions
24 Uniformity Tests on 25 Hash Functions
25 Unique Hash Values - 500 Tests with 2 Million Random Values .
26 Percentage of Non Colliding Packet IDs
27 Additional Traffic in One Hash Function Scenario
28 Measurement Traffic Increase using 1 instead of 2 Hash Functions

List of Tables

1 Comparison of Sampling Techniques
2 Traces Used for Evaluation
3 Traces Protocols
4 Bytes Recommended for Hash-Based Selection
5 Largest Clusters of Identical Packets in Traces
6 Packets in Collisions for Different Selected Byte Combinations .
7 Hash Function Collection
8 Avalanche Test Results
9 Individual and Global Rejections of Independence Test Length .
10 Individual and Global Rejections of Independence Test Protocol
11 Global Rejections of Independence Test For Byte Values
12 Summary Independence Test Results
13 Summarized Test Results
14 Hash Function Collection Properties

Chapter 1 Introduction

The intention of this introduction is to describe the scope of hash-based selec­tion. The first section will explain the motivation for network measurements, main driving forces and why network measurement serves research and commer­cial demands. Multipoint measurements are a challenging area of network mea­surement, because an immense amount of measurement data is created. Hence one often has to assess if it is necessary and applicable to measure all packets. For a variety of scenarios it is rather useful to only select a subset of packets. Different filtering and sampling techniques will be compared on their suitability for multipoint measurements. Further, scenarios for hash-based selection will be presented. At the end of the introduction I will define the targets of the thesis.

1.1 Measurements in IP Networks

Network Measurements play an essential part in operating and developing to­days Internet. From the operating view, Internet Service Providers (ISP) need to measure data volumes and usage duration on a per customer basis in order to account for used services. Research is asking for more measurement informa­tion in order to analyze problems in different topics. The diversity of network nodes and the variety of network hardware, topologies, links, protocols and ap­plications make it a challenge to identify why and where packets get fragmented, changed, reordered, lost or delayed. Network measurements help to understand network behavior and can initiate ideas for improvement. The following classifi­cation of applications shows motivations for network measurements according to [RFC3917] and [Zse05].

Usage-Based Accounting

As already pointed out, ISPs may charge their customers for the usage of Internet access and Internet services depending on usage duration or transfer volume. This implies that the service providers can measure the transmitted traffic in order to bill their customers.

Traffic Engineering

Traffic Engineering is concerned with network performance and resource opti­mization. One goal of traffic engineering is efficient load balancing between net­work nodes to ensure that network nodes on one path do not become overutilized while other adjoining paths are underutilized. On the basis of measurements of link utilization and routing information, one can adjust load balancing policies to avoid congestion. As an example measurements in ad-hoc wireless networks are a wide research area, because engineers want to understand problems and improve routing policies.

Traffic Profiling

Traffic Profiling serves multiple applications. It is a process of characterizing traffic flows by measuring duration, volume, burstiness and packet information like protocols and services. The data is used for network planning, trend analysis and business models development. For example one can measure which influence VoIP traffic has on network performance.

Attack/Intrusion Detection

Network measurement is inevitable for attack and intrusion detection. Depending on the type of network attack, specific patterns can occur in the network traffic. For example an abnormal increase of small flows in a network node is suspicious of a bot net attack. In case those attack patterns are detected a network can commence counter actions to regain normal traffic status.

QoS Monitoring

Service providers and customers often negotiate on different quality metrics of the provided service. These Service Level Agreements (SLA) require the mea­surement of Quality of Service (QoS) parameters in order to validate the con­tracts. For instance customers demand measurements to ensure that their TV- on-demand provider really complies with promised reachability and delay.

Fault management and network maintenance

Network maintenance serves the goal of configuring and monitoring the network. Fault management as a part of network management shall detect, recover and limit the impact of failures in the network. For instance, if a network link is broken, measurements automatically detect the failure and the network can be reconfigured.

1.2 Active and Passive Measurements

Network measurements can be distinguished in active (intrusive) and passive (non intrusive) measurements. Active measurements [Skit] [NLANR] inject test traffic into the network in order to measure network characteristics. For example ping and traceroute send ICMP echo ping requests to a specified destination.

On the basis of the echo reply, the availability of the destination, the path and delay can be measured. Active measurements produce additional traffic load on the network which distorts measurement results. Further one cannot ensure that the active packets are equally handled as the traffic that they are intended to measure. In contrast, passive measurements only use existing traffic to mea­sure the network characteristics. A special probe or some additional intelligence incorporated in a network node extracts relevant traffic information from the observed packet stream and reports the results. Hence passive measurement pro­vides information about traffic already present in the network. Because of this non-intrusive property passive measurements are valuable for SLA validation and usage-based accounting.

1.3 Passive Multipoint Measurements

A broad spectrum of network measurement applications demand for passive mul­tipoint measurements. Providers of interactive services, like audio and video conferencing, guarantee their customer certain delay limits which they need to validate. Multipoint measurements can also be an input for traffic engineering. Packets can be traced throughout the network which allows a detailed picture of the measured domain. Routing loops can be detected and network locations where packets get reordered or lost can be identified.

In Fig. 1 the general concept of passive multipoint measurements is shown. As a packet traverses through the measured network it passes observation points. These can be any device that can listen on the shared medium such as a net­work card or a router. At the observation point a copy of the packet is taken and a packet ID which identifies the packet throughout the network is gener­ated. This can be either parts of the packet or a hash value over the packet content. The packet ID and a timestamp of the packet’s arrival are transferred to a common multipoint collector. The collector can either be a dedicated device or can be co-located at an observation point. The trace of each packet and the delay between the observation points can be calculated by correlating the packet IDs from the different observation points. The delay is gained by subtracting the timestamp values of two corresponding packet IDs from two measurement

illustration not visible in this excerpt

Figure 1: Multipoint Measurements

1.4 Measurement Architecture and Measurement Pro­cess

In order to show a more detailed view of the measurement procedure, a partial measurement architecture and the measurement process as depicted in Fig. 2 shall be explained. The following representation is conform to the IETF drafts [ZMDN07] [Duf07] of the IETF Packet Sampling (PSAMP) working group. The measurement process comprises of packet capturing and a set of packet processing steps.

Packet Capturing records the packets including header information and packet data. The amount of data can be limited by a pre-configured snapsize.

Time Stamping is essential for multipoint delay measurements and needs to be processed as early as possible in the measurement process. For consistent time stamping between measurement points the internal clocks need to be synchronized by either one of the following technologies: network timing protocol (NTP) [RFC1305], global positioning system (GPS), radio signals or precision timing protocol (PTP) [IEEE1588].

Classification is especially relevant for flow-based measurements. The packets are grouped according to identical properties (e.g. source address and des­tination address), where all packets join exactly one group (flow). Moreover stratified sampling is based on classification, whereby estimation accuracy can be improved by grouping the population [Zse03].

Selection During the selection process a subset of the observed packet stream is chosen. This can be done by a deterministic filter on the packet content or by sampling which is packet content independent. Hence, the selection process either extracts relevant data or chooses a representative subset of the packet stream. In contrast to classification, a selector does not allocate each packet into a group, packets are either selected or dropped. Hash- based selection is such a selector.

illustration not visible in this excerpt

Figure 2: Simplified Measurement Architecture and Measurement Process

Aggregation combines several information from multiple packets into an ag­gregated composition of values. Hence a summary of characteristics of the whole traffic and/or of each flow is exported. For multipoint measurements it is required to generate a packet ID during the aggregation step. The packet ID may be a hash value or parts of the packet content.

Export of measurement data The measurement data is transferred from an observation point to a collector for further post-processing. For example Netflow and sFlow are protocols that support the transmission of flow based measurement data. The PSAMP group recommends the IP Flow Informa­tion eXport (IPFIX) [IPFIX] protocol which became an IETF approved standard.

Whereas aggregation is mandatorily the last step of packet processing, classi­fication and/or selection can be used multiple times in arbitrary order. Although it is advised to apply selection before classification as less data needs to be cate­gorized, it may also be useful to apply classification first and sampling afterwards (e.g. stratified sampling). At the completion of a measurement interval the ob­servation point exports the aggregated values to at least one collector. The mea­surement results can be transferred over a dedicated measurement network or the measured network itself. Transferring the measured data over the network itself entails similar problems as active measurements because additional load is intro­duced into the network. This may cause bias of the measured data. Nevertheless one can prevent bias by exporting results 1) at times when no measurement is taken (measurement-gap) or 2) over a network that is not measured.

1.5 Incentives for Sampling

There is a general problem about network measurements. More applications demand for more types of measurement data which the measurement infrastruc­ture may not readily support. On the contrary, if measurement hardware is deployed a single observation point can produce an immense amount of data per hour. With the growing usage of the Internet and higher data rates, even more measurement information is created. This implies an expensive infrastructure of high-performance measurement nodes and collectors in order to cope with the amount of data. There are three constrains for the network measurement infrastructure:

1. Processing capacities. Higher data rates caused by high bandwidth access networks imply high processing capacities for network routers. Therefore only limited processing resources can be allocated for network measure­ments without influencing the normal network operation.
2. Storing capacities. More applications demand for flow-based measurement data which implies that flows need to be maintained in the measurement node. More elaborate measurements even ask for multiple metrics per flow which increases storing demands.
3. Exporting capacities At times where network load is high and more flows are present, the nodes export more measurement data which engraves band­width consumption. Especially if measurement results are transferred over the measured network this leads to distortions in the measurement results.

Multipoint measurements even demand for additional resources because:

- The packet ID generation (e.g. hash value calculation) and timestamping requires processing capacities at the measurement node.
- For every packet at least a packet ID and timestamp is created which need to be stored until the end of the measurement interval.
- More measurement data is exported because packet ID and timestamp cannot be aggregated and both values need to be exported for each packet.

Whenever the storage, processing and exporting resources for measurements are depleted, information will be lost. In case of buffer overflow the exported metrics may not be useful as they do not contain any flow information that occurred after the overflow. In case of bandwidth exhaustion the metrics may be exported with high delay which reduces their value if they are time critical. Service Providers therefore demand for easy to implement methods to reduce data and measurement infrastructure costs. Packet Selection, like sampling and filtering, provide less measurement costs by reducing the measured data. Instead of measuring all relevant data, only a subset is selected and the characteristics of the whole traffic are estimated. Service providers therefore can offer lower tariffs because of less measurement costs.

1.6 Selection Techniques

As pointed out in Section 1.4 selection is a part of the measurement process. In this packet processing step the selection of a subset is conducted, i.e. packets will be either selected or dropped. The variety of selection methods can be distinguished in filtering and sampling. A filter is a deterministic selector based solely on the packet content or a router state. A router state may be the ingress and egress interface of the packet, the origin or destination Autonomous System, the incidence that the packet violates an Access Control List or that the packet cannot be correctly routed. All other selectors that use other information

illustration not visible in this excerpt

Figure 3: Packet Selection

than the packet content are considered to be sampling. A categorization of selection methods according to the PSAMP group is shown in Figure 3.

1.6.1 Sampling

In order to describe the different sampling techniques, the terms in the scope of sampling are presented as defined by the PSAMP working group.

illustration not visible in this excerpt Systematic Sampling

Systematic Sampling is a deterministic selection at a time or count based period, i.e. either a packet is selected at the end of a time interval or every m-th packet in the observed stream is selected. Random Sampling

n-out-of-N A measurement interval is subdivided in blocks with the size of N packets. From each block a random subset of n packets is chosen. The attained and configured sampling fraction is therefore the same. A special case of n-out- of-N sampling is 1-out-of-K sampling.

1-out-of-K is a count-based stratification of n-out-of-N sampling. From each block consisting of K packets one packet is chosen randomly. 1-in-K sampling guarantees that each period of K, one packet is selected but in no fixed period.

illustration not visible in this excerpt

Figure 4: Sampling

Probabilistic Sampling Each packet in the population is selected indepen­dently with a predefined probability. This probability equals the configured sam­pling fraction.

Non-Uniform Sampling An enhanced sampling method is non-uniform ran­dom sampling, where first packets are classified and afterwards sampled with different probabilities depending on the category they are in. This is done be­cause random sampling in IP networks has one major flaw. The estimation for small flows is very inaccurate. This is caused by the lack of packets from small flows, e.g. a flow that consists of 10 packets where each packet is selected with a probability of 10% has a chance of (0.9)10 = 35% not to be even in the sample. The problem here is that actually a small number of packets introduce the most traffic on the link [ClMT98]. One can improve estimation accuracy by using non-uniform sampling.

1.6.2 Filtering Property Match Filtering

Property match filtering selects a packet if specific fields within the packet and/or properties of the router state equal a predefined value [ZMDN07]. Packet prop­erties include header field values like netmask prefix of source and destination address or the used transport protocol. For instance Internet Service Providers may distribute IP addresses with the same netmask for customers that do not use flatrate services. In this way the Service Provider can easily filter flows that need to be measured for accounting. The selection can also be based on router states like ingress and egress interface of the router or the autonomous system where the packet is originated. Hash-Based Selection

Hash-based selection is a filtering method because the selection decision is solely based on the packet content. In the researched papers the terms hash-based sam­pling, hash-based filtering and hash-based selection are optionally used and all refer to the same selection technique. In this paper the term hash-based selection is used because it complies with the categorization within the measurement pro­cess as defined by the PSAMP working group. Before the hash-based selection technique is explained an introduction to the terminology of hash functions and hash-based selection is given.

illustration not visible in this excerpt

Figure 5: Hash-Based Packet Selection

illustration not visible in this excerpt

The technique of hash-based selection shall be explained with the use of Fig. 5. The content of an IP packet is structured; starting with the IP header, followed by the transport protocol header (e.g. TCP header) and further data which is labeled here as payload. For hash-based selection only parts of the header information and possibly some payload is cropped out from the rest of the packet. Optionally a not publicly known secret key is combined with the selected packet content to make up the hash input. The hash function h assigns to the hash input c a correspondent hash value h(c) by applying bit or bytewise operations on the input. The measurement operator priorly defines a selection range S C R which is part of the hash function's hash range. When the calculated hash value falls into this selection range the packet is selected. Provided that the hash input, hash function and selection range are equal at the different observation points, this technique ensures that the selection decision for every packet is consistent throughout the network.

1.7 Comparison of Selection Techniques

The different selection techniques are compared in Table 1 in terms of resource consumption, security and applicability for multipoint measurements.

Multipoint Measurement applicability Systematic sampling is not suitable for multipoint measurements. One cannot assure to select the same packets at each observation point along the packets path because of packet reordering and packets that are coming along and leave the packet’s path. Random selection techniques are not applicable for multipoint measurements because packets are independently selected at each observation point. If the configured sampling fraction at each observation point is f (f < 1) and one samples at n measurement points, the collector receives on average fn traceable packets, i.e. packets which are sampled at all observation points. Property-match filtering and hash-based selection provide a consistent selection of packets because the selection decision is based on the packet content. Hence selected packets from different measurement points can be correlated.

Bias Random sampling acquires an unbiased estimation of the population’s characteristics, because packets are selected independently. The deterministic selection techniques bear the risk of bias because packets are not selected inde­pendently. For systematic sampling bias is introduced when packets with certain attributes occur in between the selection periods. Because the hash-based selec­tion is based on the packet, the selection decision is suspect to bias. Depending on the traffic in the network, the underlying hash function and the configured hash input, the selection may prefer packets with certain attributes. Property match filtering is biased by definition, because packets with certain packet attributes are selected. Nevertheless there exist approaches to filter packets according to packet attributes that are uncorrelated to other packet attributes. For example Gon Jian [JiGu02] proposes a filtering based on the packet IP identification field. With hon-uniform sampling packets are categorized according to their attributes

illustration not visible in this excerpt

Table 1: Comparison of Sampling Techniques

and different selections probabilities are used for these categorizes, hence the selection decision is content dependent.

Selection Effort In terms of resource consumption systematic sampling is very lightweight because a counter with one random initial number or an internal clock is sufficient to derive the selection decision. Random sampling techniques require a random number generator or a stored pseudo random number table within each observation point. For the random sampling techniques the selection decision can already be made before the measurement interval, i.e. the random packet numbers that are going to be selected can be generated before the actual measurement. This is not possible for filtering methods because the selection decision is based on the packet content. Property match filtering requires the extraction of the filtering information from the packet. The packet content is compared with the filter expression. This method can be done at line speed, e.g. with a berkeley packet filter (BPF). Compared to the other selection methods hash-based selection is very resource intensive. Hash-based selection involves the extraction of the hash input and a hash value calculation for each packet during the measurement.

Security In the scope of selection techniques the term security describes the possibilities of an adversary to intentionally influence the selection probability of packets. Random selection techniques are secure, because an adversary cannot influence the packet selection. For systematic sampling an adversary may gene­rate malicious packets periodically and places them in the middle of the selection period. Property match filtering is very insecure, because the adversary can craft packets that are disproportionally selected very easily in case he knows the fil­ter expression. The security of hash-based selection is discussed in [GoRe07]. Goldberg and Rexford prove that the security of hash-based selection is depen­dent on the underlying hash function and the use of a secret key. They prove that hash-based selection based on a linear hash function is insecure whereas an adversary cannot craft intentionally selected packets with the use of a keyed Pseudorandom Function (PRF). The security topic of hash-based selection will be more thoroughly assessed in Section 4.3.

1.8 Scenarios for Hash-Based Selection

Hash-based selection enables the correlation of packets which are selected at dif­ferent measurements without exporting the whole population. One can follow a small subset of packets traversing through the network with only little measurement data. In case the packets are representative for the whole popula­tion, this subset can be used for estimating the behavior of the whole traffic.

I. Traffic Engineering

Routing Improvement Because hash-based selection can trace a packet through the network, one can identify routing loops and routing inefficien­cies on the packets path. This data can be used to improve routing policies.

Duffield, Grossglauser [DuGr01] and Niccolini et al. [NMRT04] show how paths of packets can be made traceable with the use of packet IDs.

Network Topology Discovery In case of a broad deployment of routers that are able of hash-based selection, it is possible to discover the network topology [EBNC07] without active measurements. One collector can gather information about where packets are traversing, with which delay and loss between each network node. Added with some more information like band­width utilization, the collector can draw a detailed picture of the network topology of the collector's domain. The ingress and egress points matrix can be made visible by looking for points where new packet IDs appear and disappear.

II. QoS Monitoring

Multipoint Metrics Because packets can be correlated between two mea­surement points, the collector is able to calculate multipoint QoS metrics. Service providers can use hash-based selection to figure out if they comply with their Service Level Agreements in terms of one-way delay, one-way loss and derived metrics like delay variation. The measurement of one way delay with the use of hash-based selection is discussed in various papers [DuGr01] [ZsZC01] [RaQB04] [NMRT04].

III. Attack/Intrusion Detection

IP traceback Network attacks are still a wide problem in the Internet. The nature of the IP protocol makes it difficult to identify the true source of malicious packets if the source wants to conceal it. An attacker can generate packets with spoofed addresses (i.e. forging another IP address) and hide his localization. In case hash-based selection is widely deployed the multipoint collector can trace selected packets back to their origin by simply following the packet IDs path [SPSJ01] [SPSJ02]. In case of DoS attacks where multiple packets are sent this is already sufficient.

Another applicable scenario is an adversary using a proxy to hide his net­work address. The adversary sends a packet to a proxy which changes the source network address in the packet and replaces it with its own. In case hash-based selection is deployed and the hash input does not comprise the source network address, the packet is still traceable beyond the proxy.

1.9 Problem Statement and Target of this Work

1.9.1 Problem Statement

Hash-based selection is a deterministic selection of packets and thus it is sus­ceptible to a biased selection decision. The selected subset may include a higher fraction of packets with certain attributes than there is in the population. This biased selection decision concludes to a biased estimation of traffic characteristics. In case packets with e.g. certain lengths are preferably selected it is impossible to give a sound estimation of the packet length distribution.

1.9.2 Targets of this Work Emulation of Random Sampling

Random Sampling provides features which are desirable for network measure­ments. Random sampling selects packets independently which excludes bias and ensures the selection of a representative subset. Based on the representative subset the traffic characteristics can be estimated with predictable accuracy. In case hash-based selection can emulate random sampling, it can be modeled as random probabilistic sampling for which already mathematical models for bias and accuracy are available [Zse03].

The main target of the work is to evaluate with which configuration hash-based selection can emulate random sampling in terms of unbiasedness and representa­tiveness of the selected subset. There are two factors in the hash-based selection technique that have an influence on the selection decision quality:

1. the hash function
2. the packet content used as hash input Packet Fields for Hash-Based Selection

The choice of the packet content (header fields and payload) to be used for the hash input is of utter importance, because this decision can introduce bias. This is shown by the example in Fig. 6. This figure shows two different packets 1 and 2 (only the IP Header) which have similar packet contents. It is assumed that only the first 8 bytes of the packet are used as the hash input. Since these fields are equal for both packets, the same hash input c is formed. Thus both have the same hash value h(c) and the same selection decision. Let’s assume there are more packets like Packet 1 and 2 that are equal in the most significant 8 bytes, all having the length of 276. They all map to the same hash value and either all or none of them are selected. This means that packets with 276 bytes length are either over- or underrepresented in the sampled subset.

This example shows that the selected bytes from the packet content should be sufficient to identify unique packets in order to avoid input collisions, i.e. pack­ets with the same hash input. Especially for flow-based sampling where packets are already categorized according to common properties one has to ensure that packets within a flow can be distinguished. As a restraint it has to be clear that it is not suitable to use the whole packet content as hash input because routers

illustration not visible in this excerpt

Figure 6: Input Collision of 2 Different Packets

that serve as measurement points have only a limited capacity for processing packets and measurements. The hash function has to be applied for each packet and therefore the amount of bytes that is used for hashing should be kept at a minimum, because this will reduce the calculation effort.

One target of the thesis is to find a hash input configuration that includes low input collisions with minimal input length. Hash Functions for Hash-Based Selection

The choice of the hash function is of crucial importance to avoid bias. Hash functions of poor quality do not well distribute the hash input over the hash range. Hash inputs which are similar will be mapped closely together, therefore packets that are similar will have a high probability to possess the same selection decision. This implies bias, because similar packets are not independently selected. Moreover the hash function should be secure to prevent adversaries to influence the selection decision. Contradictory to the security is that the hash function also has to be fast, because the hash value has to be calculated on every packet that is passing the observation point.

In this thesis one target is to find a hash function that is suitable for hash-based selection and is able to select an unbiased sample based on the hash input. Packet ID Generation

An adjacent topic to hash-based selection is packet ID generation. Each selected packet will be assigned a packet ID and a timestamp which are exported to the multipoint collector as described in Sec.1.3. This packet ID is not necessarily the hash-value that is obtained by hash-based selection, because the packet ID re­quires different properties than the hash-based selection hash value. In particular the packet ID requires low collision probability. Hence, related work [DuGr01] [NMRT04] [MoND05] proposes to use two different hash functions for packet ID generation and hash-based selection. A target of this work is to find hash func­tions and hash input configurations that are applicable for packet ID generation. Further it is evaluated under which scenarios one may use one hash value for packet ID generation and hash-based selection.

1.10 Document Structure

The document is structured as follows:

In Chapter 2 existing literature in the scope of hash-based selection is presented. Further this chapter shows existing approaches to evaluate the hash input and hash function for hash-based selection. Chapter 3 is dedicated to the evaluation of IPv4 and IPv6 packet content for hash input. First the investigated traces will be presented, followed by the empirical analysis. The results of the analysis will give suggestions which parts of the packet content shall be used as hash input.

In Chapter 4 quality criteria for hash functions will be defined. After some consid­erations about security issues, a collection of 25 hash functions will be evaluated according to these quality criteria. Each hash function will use the recommended combination of header fields from chapter 3 as hash input.

In Chapter 5 the applicability of hash-based selection for different traffic classes will be investigated. The influence on the selection decision and reasons of du­plicate packets is discussed.

In Chapter 6 it is analyzed which hash functions are suitable for packet ID ge­neration and if hash-based selected packets need an extra packet ID hash value.

The concluding Chapter 7 will summarize the results. Still open problems will be presented for future research.

Chapter 2 State of Art

At first relevant work on multipoint measurements which is in the scope of hash- based selection will be presented. Later relevant work for hash-based selection is analyzed according to the targets of the work: 1) which packet content to use and 2) how to evaluate hash functions for hash-based selection and packet ID generation.

2.1 Passive Multipoint Measurements

Passive multipoint measurement techniques have been proposed for packet tra­cing and one-way delay measurements in [RaQB04] [NMRT04] [ZsZC01][DuGr01]. In contrast to active measurements, passive measured packets are not marked but their occurrences recorded at the measurement point. For this purpose a packet ID is generated based on the packet content. In case different packets have the same packet ID in the collector domain, the collector needs to identify which packet ID belongs to which packet. In [DuGr01] Duffield and Grossglauser show which trajectories of colliding packet IDs are ambiguous and which can be dis­ambiguated. In Fig.7 two exemplary trajectories of two packets with the same packet ID originating at the ingress nodes A and B are depicted. Whereas in example (a) the trajectories of both packets can be fully reconstructed, there are multiple possible packet traces in example (b). Although it is possible to

illustration not visible in this excerpt

Figure 7: Trace Disambiguation of Two Packets with Same Packet ID

disambiguate the trajectories in example (a) it is not possible to measure the delay between the nodes, because one cannot resolve in which order the packets originating at A and B arrive at C and E.

Again Duffield [DuGG02] [DuGr04] and Niccolini et al. [NMRT04] show algo­rithms to differentiate packets according to their timestamp. They argue that the only place to identify packet ID collisions is at the measurement domains ingress nodes, i.e. nodes where traffic is introduced into the network. If a packet ID is observed more than once at any ingress node within a given time window than all observations of this packet ID are excluded from delay measurements. The time window has the size of the measurement domains maximum delay tmax, starting at the first occurrence of the packet ID at an ingress node. In order to compensate for the eliminated packet samples, the sampling fraction is adjusted by the packet ID collision probability.

In [DuGr04] Duffield and Grossglauser further propose the use of a Bloom filter to check whether a packet ID is already present in an unreliable network. A Bloom filter consists of a bit array of size M and a set of v hash functions with codomains [1..M]. Starting off with an empty Bloom filter where all bits of the array are set to zero, the packet IDs is used as the hash input for each of the v hash functions. The v hash values are then coded into the bit array by setting each bit that corresponds to the hash values to one. If the same packet ID arrives again, than the bits for this packet ID are already set within the bit array and all observations of this packet ID will not be used for delay measurements. Although all duplicate packet IDs will be identified with this technique, there is a chance of false positives, i.e. packet IDs that have not been observed yet but are identified as duplicates, because the bits in the array were already set by other packet IDs. Duffield and Grossglauser [DuGr04] propose to compensate for these accidental discarded packets by adjusting the sample size.

An approach by Eriksson et al. [EBNC07] enables the reduction of the obser­vation points by deducing the network topology based on the IP time to live value. He proposes to send additional the TTL together with the packet ID to the collector. With this approach it is for example possible to trace a packet in Fig. 1 traversing from Point A to B to C without measuring at point B, because these packets will have a reduced TTL of 2, whereas packets directly traversing from A to B only have a reduced TTL of 1.

Raspell et al. [RaQB04] proposes a method to configure passive measurements using signaling based on the NSIS Transport Layer Protocol (NTLP). NTLP packets are send from a source to a destination configuring each visiting router. In this way measurements for a specific flow can be configured, e.g. an adjust­ment of the selection range or defining the start of the measurement interval.

An IP traceback system as proposed by Snoeren et al. [SPSJ01] [SPSJ02] uses a similar technique to hash-based selection in order to trace single malicious pack­ets to their destination by storing hash-values over invariant packet content in Bloom filters. Wang et. al [WaYL06] presents a heuristic method for traceback systems that derives the amount of required routers to identify an attack within a predefined hop-count distance.

2.2 Packet Header Fields useful for Hash-Based Se­lection

Duffield et al. [DuGr01] is the first to present the hash-based selection method for tracing packets through the network. In order to distinguish packets throughout the network, he declares that those packet fields should be used for hash-based selection and packet ID generation that are

1. constant between nodes, but
2. variable between packets.

He divides the IPv4 header fields into three categories 1) variable, 2) low entropy and 3) high entropy fields. He identifies the TTL, type of service and header checksum as variable fields which cannot be used for hash-based selec­tion, because they change between network nodes. Low entropy fields are ver­sion, header length and protocol. High entropy fields are source and destination address, identification, fragment offset, flags and total length. The low and high entropy fields are invariant between network hops and therefore can be used for hash-based selection.

In order to find out the amount of bytes that should be used for hashing, Duffield identifies collisions between packets when variable fields are masked out. Colli­sions are here defined as different packets that have the same hash input value, because not the whole packet is used as hash input. He uses traces of packets that are recorded at a campus network LAN interface. His results show that the input collisions drop the more bytes are used. Increasing the hash input length beyond 40 consecutive bytes of the packet does not reduce the collisions any fur­ther.

Molina [MoND05] uses the same approach to test for collisions but after applying the hash function. He evaluates how many hash values collide depending on the input length. With increasing hash input length the collision rate drops until a certain length after which there is no further significant decrease.

Snoeren [Sn01] uses the hash-based approach for IP traceback systems and iden­tifies unique packets by considering different input length. All three authors (Duffield, Molina, Snoeren) conclude that an input length between 28 and 40 consecutive bytes (without variable fields) is sufficient for hash-based purposes. Niccolini et. al[NMRT04] uses the same header fields like Duffield and Molina except the fragment offset and flags fields because they can change at network hops. Additionally he uses an arbitrary amount of 12 byte long blocks of the transport header.

The PSAMP [ZMDN07] working group advises the use of IPv4 identification field, flags field, fragment offset, source IP address, destination IP address and a configurable number of bytes from the IP payload, starting at a configurable offset. Duffield, Snoeren and Molina only consider general assumptions about header fields that are suitable for hash-based selection but lack the basis of an empirical and bytewise analysis.

The entropy-based approach used in this work enables the ordering of header bytes according to their suitability for hash-based selection. This systematic ap­proach is used in order to find a smaller subset of bytes that show similar or better results than the recommended PSAMP bytes.

Entropy-based measurements have already been proposed in the field of selection techniques by Gong Jian [JiGu02]. He analyzes the entropy of IPv4 Header fields per bit and byte on a CERNET backbone link. The analysis includes the first 20 bytes of the IPv4 header of 10 million different packets. He shows that there are major differences of entropy and that only the identification field has an entropy efficiency of close to 1. He proposes to sample according to the IP Identification field because it is variable and does not change during measurement points. Fur­thermore entropy based measurements have been proposed for network anomaly and worm detection in IP networks [WaPl05] [LaCD05]. In case of network at­tacks the distribution of packet characteristics change which can be detected by entropy measurements. Usually the entropy of e.g. address bytes drops because the attack traffic is less variable in those bytes than the normal traffic.

2.3 Existing Evaluation Approaches for Hash Func­tions

A general performance comparison of cryptographic hash functions can be found at [Dai] [Erh]. The MD5 and SHA-1 algorithm are well known and show com­parably good calculation time. Nevertheless cryptographic hash functions are generally very computation expensive. Mulvey [Mul] and Jenkins [Jen] show ap­proaches to evaluate hash functions on their randomness and uniformity. They uses the Avalanche criterion and chi-square uniformity tests to evaluate hash functions for general purposes.

Evaluation of Hash Functions for Hash-Based Selection Molina [MoND05] analyzes four hash functions (CRC32, MMH, IPSX, BOB) on their suitability for hash-based selection and packet ID generation. As criteria for the suitability for hash-based selection he uses performance and the ability to uniformly distribute the hash values. He argues that the uniformity of hash values is a quality criteria for hash-based selection as the selection range can be easily configured to gain a certain sample size. Further Molina analyzes the collision probability of the hash values to evaluate their applicability for packet ID generation. He uses syn­thetically generated packets and real packet traces for the evaluation. The BOB hash function performs slightly better than the 2nd placed CRC32 function for hash-based selection and packet ID generation.

Zseby [ZsZC01] compares 6 hash functions for packet ID generation (40 bytes of unprocessed fields, CRC-16, a 16-bit hash function, CRC-32, a folded 32bit MD5 and MD5-128bit) on 1) the size of the resulting ID, 2) probability of collision 3) processing time and 4) effort to encode additional information. The measure­ment results are based on a modular meter implementation. The unprocessed fields do not require any processing and include minimal collisions, but imply maximum measurement traffic. The 16 bit hash values include large amounts of collisions and the calculation time is not significantly less than for CRC 32bit hash function. CRC-32, folded 32bit MD5 and 128 bit MD5 provide low collision probability for artificial packets and for a real RTP-Flow.

Duffield [DuGr01] evaluates a simple modulus hash function using four different traces from a campus network. He uses the chi-square independence test in order to analyze if there is dependence between network address prefix and sampling decision. He investigates dependence between single bits and sampling decision. There has been no statistical sign for dependence if 40 input bytes where used. Furthermore he evaluates if hash-based sampling can genuinely estimate fractions of packets with common network addresses. He reasons that good hash functions can ’’randomize” sampling decisions such that the set of sampled packets are re­presentative.

Niccolini et. al [NMRT04] and Raspell [RaQB04] use the MMH hash function for hash-based selection and packet ID generation, because of its calculation speed and uniform hash value distribution.

The PSAMP[ZMDN07] working group advices that the BOB hash function should be used for hash-based selection whereas IPSX and CRC may be used.

Molina analyzes only 4 hash functions solely on their performance and uniform hash value distribution as criteria for hash-based selection. Nevertheless, one can adjust the selection range according to any hash value distribution in order to obtain a certain sample size. In this work it is reasoned that unbiasedness and representativeness of the selected subset are more important for accurate measurements than the uniformity of hash value distribution. Duffield evaluates only one linear hash function on the independence of sampling decision and one packet attribute (network address).

In this work a collection of 25 hash functions will be analyzed on the independence of multiple packet attributes and the sampling decision. Further, the chi-square goodness-fit test is used to evaluate if hash-based selection is able to produce a representative subset of the observed packet stream and hence can emulate ran­dom sampling.

Chapter 3 Evaluation of Packet Content

3.1 Approach

3.1.1 Applicable Header Fields

The IP and transport header fields upon which the hash-based selection decision is based require two properties. The header fields need to be

1. static between network nodes. The header fields need to be static be­cause the sampling decision is required to be consistent throughout the net­work. If variable header fields are used as hash input the sampling decision based on these fields will differ from measurement point to measurement point. Header fields that are not static between network nodes are: Time to Live and IP Checksum. In case of packet fragmentation on the packets path, the packet length, fragmentation flags and fragment offset fields may change. Nevertheless it is here assumed that most fragmentation issues are negotiated between the end nodes and that fragmentation in the network is rare. Hence packet length, fragmentation offset and fragmentation flags are considered to be static.
2. variable among packets. In order to prevent bias caused by input col­lisions it is favorable that the hash input of every unique packet is unique as well (as shown in The higher the variability in one header field the higher the probability that two packets differ in this field.

In this section the entropy per byte is used to evaluate the variability of each header byte. The higher the entropy in a byte, the higher the probability that packets differ in this byte and hence the byte is more suitable for hash-based selection. In the second part of the evaluation it is analyzed which influence the choice of high entropy bytes has on the probability of hash input collisions.

3.1.2 Entropy per Byte

The entropy per byte [JiGu02] is defined as:

illustration not visible in this excerpt

Where B is the discrete variant of byte values; pi is the probability that the byte value i occurs. The maximum entropy denotes the state that all possible byte values have the same probability of occurrence. For scaling purposes the byte entropy is devided by its maximum value Hmax to obtain the information efficiency E [JiGu02].

illustration not visible in this excerpt

The information efficiency E measures the byte randomness. An information efficiency of 0 denotes no randomness (i.e only one value occurs) and a value of 1 denotes maximum entropy.

3.2 Traces Analyzed

illustration not visible in this excerpt

Table 2: Traces Used for Evaluation

When performing experiments which involve packet content it is inevitable to use real traffic traces. Since traffic differs between locations in the Internet (backbone, edge-routers, campus, hosts), geographical locations and time, one can only use many different traces in order to reflect possible scenarios.

The evaluations will be based on a set of 7 trace groups which are recorded at different observation points. All traces can be accessed via the MOME database [MOME]. The NZIX traces [NZIX] were recorded by thee WAND research group at a peering; point among a number of major IST5 ’s in New Zealand. FH Salzburg trrces are captured at an WAN Access Network on a student campus. The Twente traces are captured at in the netherlands at an aggregated uplink of an ADSL access network with a couple of hundred ADSL customers, mostly rtudent dorms. There are two trace groups that were recorded by a large european telecommunication operator and are noted as LEO1 and LEO2. The MAWI traces [MAWI][ChMK00] are the only accessible IPv6 traces. The traces were captured at two samplepoints at lines connected to the 6Bone and WIDE-6Bone. The 6Bone is located on a FastEthernet segment connected to an IPv6 internet exchange point in Tokyo. All but the Ciril and IPv6 MAWI traces include parts of the transport header (max. 20 bytes).

Anonymization Most traffic traces are anonymized in order to protect data privacy. This anonymization implies that the IP address fields (source and des­tination) are irreversibly scrambled. The IP addresses of the Twente and FH Salzburg traces are anonymized with the Tcpdpriv ”-A50” option. This option ensures that within a trace two original addresses which equal in the most sig­nificant n bytes will have scrambled addresses that are equal in these n bytes as well. For example, the two source addresses a.b.c.1 and a.b.c.7 will be mapped to x.y.z.n and x.y.z.k. For the other anonymized traces (NZIX and MAWI) the mode is not known. The LEO 1 and LEO 2 traces are not anonymized.

Classes of Autonomous Systems According to [DKRC06] Autonomous Sys­tems can be categorized in six classes.

1. Large ISPs: Large intercontinental backbone providers.
2. Small ISPs: Regional and access providers with small metropolitan or larger regional networks.
3. Customer ASes: Companies or organizations that run their own networks but do not provide Internet access.
4. Universities: University or college networks. These networks are distin­guished from members of the Customer AS class, since they typically have substantially larger networks that serve thousands of end hosts.
5. Internet exchange points (IXPs): Small networks serving as interconnection points for ISPs.
6. Network information centers (NICs): Networks hosting important network infrastructure, such as root or top level domain servers.

This categorization in classes of autonomous systems is applied to the measure­ment points of the available traces in Table 2. The LEO1, LEO2 and Ciril traces were captured in a large ISP network but the measurement point location is un­known. The FH Salzburg traces were captured in an University network at a WAN access point. The Twente traces are not assigned in the class of Univer­sity networks, because the traffic is captured at an ADSL uplink that provides Internet Access. The NZIX and MAWI traces were captured at peering points between a number of ISPs in New Zealand and Japan.

Although each trace is assigned to a specific AS class it cannot be assured that the presented traces are representative for this AS type. The traffic can differ depending on the geographic and network location and capturing time.

Protocols and Applications Table 3 shows that each trace group represents a different traffic scenario indicated by their different transport protocol mix and applications. Applications are assessed by the transport header port addresses. Because Ciril and MAWI traces do not include transport headers their protocol mix could not be evaluated. The FHSalzburg traces consist to 90% of http traffic. The LEO1 traffic is composed of 25% edonkey traffic and 5% http traffic; other protocols do not make up more than 1% of the traffic. The traffic captured in

illustration not visible in this excerpt

Table 3: Traces Protocols

the trace group LEO2 consists to 60% of the layer 2 tunnel protocol over UDP and 10% of edonkey traffic. The Twente traces include a variety of applications with only 2% http traffic and applications each with at most 1% of the traffic. The NZIX traces consists of about 50% http and 5% Quake packets.

3.3 Entropy Evaluation

The entropy per byte is calculated with a tool written in C++ which uses the libpcap library to read packet traces. The tool records the frequencies of each byte value and calculates the entropy.

3.3.1 IPv4

illustration not visible in this excerpt

Figure 8: IPv4 Header

All six IPv4 traces were used for measuring the entropy of the IP Header (cf. Fig. 8). The IP source and destination address fields from Twente, NZIX, FH- Salzburg traces are anonymized and therefore do not reveal the real entropy. The Twente and FHSalzburg traces show the real entropy in the most signifi­cant byte, because they were encoded using the tcpdpriv ”-A 50” option. Figure 9 illustrates the results. The diagram does not include the Time To Live and Checksum field as these are variable from hop-to-hop.

Version, IP Header Length and the least significant byte (LSB) of Fragment Offset show an entropy very close to zero and are unsuitable for hash-based se­lection. The identification (ID) field has the highest entropy which shows that the values are randomly distributed. But the identification field bears a problem; many packets include an IP ID of 0 caused by differences in IP ID handling of operating systems.

Since there exist many small packets with a length below 255 bytes, the entropy for the most significant byte (MSB) of the length field is comparably low (about 0.2 information efficiency). The figure shows that there is an increase of entropy

illustration not visible in this excerpt

Figure 9: Information Efficiency IPv4

from MSB to LSB of the IP source and destination addresses. Nevertheless, it is expected that the information efficiency of all 8 address bytes is less than all its components, because there is a strong correlation between the address bytes. The highest information efficiencies are found in these bytes: Identification, Source Address (two LSB), Destination Address (two LSB), Length(LSB).

3.3.2 TCP

illustration not visible in this excerpt

Figure 10: TCP Header

For analyzing the TCP header fields on their entropy, the Ciril and MAWI traces had to be omitted, because they do not include any transport header data. The TCP Header (Fig. 10) follows the IP header and comprises of at least 20 bytes depending if Options are present. All header fields are applicable for hash-based selection because they are invariant between network nodes. The urgent pointer field is never used in the analyzed traces and has an entropy of 0 and is therefore not depicted in Figure 11. Only the two bytes comprising offset, reserved and control bits include low entropy. Other header fields show high entropy. Because the FH Salzburg traces comprise mostly of http traffic, the source port entropy is significantly lower compared to other traces. The most suitable fields for hash- based selection with an information efficiency close to 1 are: Checksum, Sequence Number and Acknowledgment Number.

illustration not visible in this excerpt

Figure 11: Information Efficiency TCP

3.3.3 UDP

illustration not visible in this excerpt

Figure 12: UDP Header

The UDP Header consists of 8 bytes which are all invariant between network nod es. Figure 13 illustrates the results of the entropy analysis. The LSB of the port addresses show higher entropy than the MSB. For the FH Salzburg traces which includes mostly http traffic, the entropy of the port addresses is lower than for the other traces. Most UDP packets are DNS requests emerging through the usage of http.

In the LEO2 traces the checksum is not used and set to zero for most packets.

illustration not visible in this excerpt

Figure 13: Information Efficiency UDP

Most of the packets with no UDP checksum include the Layer 2 Tunnel Protocol with IP and TCP in its payload. Packets with the Tunnel Protocol include the same source and destination port numbers 1701 which causes the low entropy for the LEO2 traces in port addresses as well. The Layer 2 Tunnel Protocol is applied in virtual private networks (VPN). Although the Checksum is optional it is still advisable to use the UDP Checksum field for hash-based packet selection as it includes very high entropy. Other fields with high entropy are: SrcPort (LSB), DstPort (LSB), Length (LSB).

3.3.4 ICMP

The Internet Control Message Protocol (ICMP) differs in length and content depending on the type of message. All ICMP packets have four bytes in common: 1 Byte Type, 1 Byte Code, 2 Bytes Checksum. Following bytes can include other internet packets, an identifier, a sequence number and/or timestamps. Since there are only few ICMP packets in the analyzed traces it is refrained to distinguish each ICMP message type to analyze the entropy. Figure 14 illustrates the results. The entropy is calculated without associating any specific field to the value, hence Byte 5 denotes the fifth byte in the header. The FHSalzburg traces did not include a sufficient amount of ICMP packets, therefore¡m the entropy is very variable and is not shown in the diagram. The following bytes have the highest information efficiency and are most suitable for hash-based selection: Checksum, Byte 12-13, Byte 18-19.

illustration not visible in this excerpt

Figure 14: Information Efficiency ICMP

3.3.5 IPv6

illustration not visible in this excerpt

Figure 15: IPv6 Header

The MAWI traces are used to evaluate the variability of fields in the IPv6 header (cf. Fig. 15). The MAWI traces are captured at 2 different measurement points C(6Bone) and D(WIDE_6Bone). Both sampling points are evaluated separately. The IPv6 Header is at least 40 bytes long. IPv6 allows to chain extension headers by using the next header field. Only the first header is considered for evaluation. The Hop Limit field is similar to the Time To Live field of IPv4 and is decremented each network hop and therefore cannot be used for hash-based selection. Since the source and destination addresses are anonymized, the entropy of the original fields cannot be recovered from the original addresses. The results for the remaining 7 bytes are shown in Fig. 16.

It is obvious that there is only very low variability in those bytes. The Version field (first 4 bits) is static because only IPv6 packets are considered. The only byte that includes some more entropy is the LSB of the Length field which might be useful for hash-based selection purposes.

illustration not visible in this excerpt

Figure 16: Information Efficiency IPv6

illustration not visible in this excerpt

Table 4: Bytes Recommended for Hash-Based Selection

3.3.6 Conclusion Entropy Analysis

The entropy of the packet varies from byte to byte. For certain bytes only very few values are used (e.g. protocol) or title distribution of the values is concentrated around few values (e.g. MSB of IP length). The entropy of packet bytes is strongly dependent on the applications within the traces. The LEO2 traces that have a comparably low entropy caused by tunneled packets. There are certain bytes in the IP and transport layer header that are more suitable for hash-based selection than others. The amount of bytes used for hashing should be kept at a minimum because every additional byte increases the hash calculation and processing time at the measurement node which may cause buffer overflow. The bytes with the highest entropy are most suitable for hash-based selection as there is a higher probability that packets differ in high variable bytes. A summary of fields with high entropy is shown in Table 4. Although there are some high variable fields in the IPv4 header it is shown that it is not sufficient to use Identification, Length, Source and Destination Address. In the analyzed traces there are packets whose ID field is often zero with the same source and destination address. This is caused by different handling of the identification field in operating systems. Consequently one has to include more variability by using transport header fields as well. Especially the TCP header includes many high entropy bytes, whereas UDP lacks some. It has to be assessed if the variability for UDP is sufficient or if some payload bytes should additionally be used. The ICMP packets are usually less frequent in a packet stream. Because of the similarity and the short length ICMP packets easily collide in the hash input if not enough bytes are used. As ICMP and UDP packets are strongly similar in their headers one may even consider to use an additional amount of bytes for the different transport protocols in order to enable a better differentiation of packets. These bytes may consist of further bytes of the headers or payload.

IPv6 packets The IPv6 address fields could not be evaluated because of anonymiza­tion of the MAWI traces. Nevertheless it is insufficient to use only the 40 byte IPv6 Header because at one observation point many packets will include the same destination and source address. Other bytes do not show high variability and cannot help to distinguish all packets. Hence it is required to use additional bytes from the transport header. In future, when IPv6 traces with transport header data become available it has to be evaluated if the transport headers used with IPv6 show similar entropy as the IPv4 transport headers.

3.4 Hash Input Collisions Evaluation

3.4.1 Input Collisions

An input collision consists of multiple packets that provide the same hash input. As pointed out in Section, packets with the same hash input introduce bias to the hash-based selection. All packets within an input collision have the same sampling decision and therefore the packets are either over- or underrepresented in the sampled subset. There are two reasons why packets hash inputs' collide: 1) the colliding packets are identical or 2) the packets are not all identical but the selected bytes for hash input are equal. Input collisions of non identical packets can be reduced by using some more bytes as hash input, but to deal with identical packets is difficult. In order to cope with identical packets that occur randomly, it is recommended to change the selection range periodically and synchronically at all observation points. In this way it is ensured that packets with large interarrival times have different selection decisions and bias can be reduced.

Identical Packets While analyzing the packet traces it became obvious that there is a high amount of identical packets that occur at an observation point. Here it is defined that identical packets are packets that are identical for the whole captured snaplength including the IP and transport header except the IP Time to Live and IP Checksum field. The IP Time To Live field and IP Checksum are excluded because identical packets can take different paths to the destination or may take detours in routing loops.

Of course it is not sure that for equal packet headers the packets are all identical. Nevertheless, the UPD, TCP and ICMP checksum is calculated over the whole packet content and hence packets with equal IP and transport header are likely to be identical. For TCP packets with equal acknowledgment and sequence number an identity is even more probable. In the presented evaluation all packets with equal headers (IP and transport) are labeled to be identical, because there is not more data available to distinguish the packets for traces where the payload is missing. This is done despite the awareness that these identical labeled packets may not be identical (e.g. for UDP packets that do not use the optional UDP checksum).

3.4.2 Experimental Setup

A tool written in C++ cuts a combination of fields from all packet and matches double values. As a result, one obtains all input collisions with the corresponding number of occurrences.

In order to measure the amount of collisions caused by identical packets, the collisions that appear for the whole IP and Transport header (except TTL and IP Checksum) are analyzed. Secondly, 8 high entropy bytes are used in order to compare how many additional input collision are implied by using only parts of the packet content. The amount of identical packets will be compared to collisions caused by a bad hash input configuration (only IP header without checksum and TTL), an 8 high entropy bytes hash input configuration and a 16 bytes combination used by Molina in [MoND05].

1. Recommended 8 bytes - based on the entropy evaluation results the IP identification field and 6 Bytes depending on the transport protocol are chosen: TCP (Checksum, 2 LSB of Sequence and Acknowledgment Num­ber), UDP (Checksum, Source Port, LSB Destination Port, LSB Length), ICMP (Checksum, Bytes 12,13,18,19)
2. Molina’s 16 bytes - Molina [MoND05] proposes the use of these 16 bytes: Total Length, Identification, Source and Destination IP address and the 2nd 32 bit word of the transport header.

3.4.3 Results Comparison of Header Byte Combinations

The size of a collision is the amount of packets with the same hash input. Large collisions are of more concern than small collisions as the colliding packets in large collisions all have the same selection decision and are either over- or under­represented in the selected subset. The same amount of packets within several small collisions is less crucial, because the packets from different small collisions can have different selection decisions and there is no ”all or none” decision. Hence the largest input collisions are of major concern.

Here the 20 largest input collisions in each trace are analyzed. The amount of packets contained in these input collisions were added for the whole trace group. For all traces there is a great amount of packets that occur in bursts and have different TTL and IP Checksum values but are equal in the remaining bytes. Detailed analysis showed that these packets are caused by routing loops or the lack of the IP ID field (IP ID=0). The results in Tab. 6 are explained by an example of the 36 Twente trace files that are evaluated. The 20 largest collisions of each trace include 13,120 packets because the packets are identical. Using only the IP header as hash input about 475,000 packets are observed in the largest collisions. With the use of the recommended 8 Bytes 16,004 packets collided, whereas with Molina’s 16 bytes 49,477 packets collide. For the FHSalzburg trace group, the content combinations of Molina and ours identify the same amount of identical packets. Our subset shows a similar amount of identical hash inputs for the NZIX traces whereas Molina’s subset has 3 times more collisions. The LEO2 trace group that consists of VPN tunnel traffic shows a different result: Molina’s 16 byte subset includes less collisions. This is caused by the low entropy of our recommended bytes in the LEO2 UDP header fields, especially the missing UDP

illustration not visible in this excerpt

Table 5: Largest Clusters of Identical Packets in Traces

illustration not visible in this excerpt

Table 6: Packets in Collisions for Different Selected Byte Combinations

checksum causes more identical packets. From the results of the input collision comparison it is concluded that it is important to use high variable fields as hash input in order to keep hash input collisions at minimum. The recommended combination is sufficient for most traces but can produce larger clusters as in the LEO1 Trace Group. In order to improve the results one can add some more high entropy bytes which are proposed, in Table 4. Traces Suitable for Hash Function Comparison

The evaluation of input collisions also identifies traces that are not suitable for hash-based selection. Many input collisions in a trace that are caused by identical packets lead to a biased selection. This bias is not introduced by the hash function but by input collisions. Hence, the traces that have very few input collisions are best suited for evaluating hash functions. As shown in Table 5 the trace groups from NZIX and LEO2 include more than 8% of identical packets. Although the LEO1 trace group includes only 0.35% identical packets, high amounts of packets share the same hash input in the 20 largest collisions (cf. Tab. 6). The NZIX, LEO1 and LEO2 trace groups will imply bias for the hash-based selection decision and should not be used for hash functions comparison. In Section 5 the reason for doubled packets in all traces will be discussed more thoroughly. This may enable a classification of traffic traces where hash-based selection is applicable. Because the Twente and FH Salzburg traces include only few collisions the evaluation of hash functions in the next chapter will be based on them.

Chapter 4 Evaluation of Hash Functions

In this chapter a set of hash functions will be evaluated on their suitability for hash-based selection. The main goal is to find a combination of hash input and hash function which selects a representative subset of packets. Quality criteria are defined upon which each function is tested. Further security issues are discussed when using hash functions for packet selection purposes.

4.1 Desired Properties

Following requirements a hash function intended for hash-based selection has to fulfill:

1. fast calculation time
2. nonlinearity
3. unbiasedness
4. representativeness of sampled subset

The hash value has to be calculated on each packet which is captured at an ob­servation point. Hence, performance is very critical for measurements because observation point like network routers only have a limited amount of processing capacity. Measurements shall not influence the normal operation of the router and only a small contingent of the nodes resources are allocated for measure­ments.

The second criterion is the nonlinearity of hash values, i.e that hash input are distributed randomly over the hash range. In case packets which only differ in some bits of the hash input are mapped closely together, they have the same sampling decision and the sampled subset is biased. This may even be exploited by an adversary who intentionally crafts packets in order to influence the sam­pled subset (more in Sect. 4.3). The Avalanche Criterion is used to assess the ability to distribute hash values randomly over the hash range.

The third requirement is that the selection decision of the hash function based on an hash input should be unbiased. The selection should not favor any packets or bitstreams to others in the selection decision. The chi-square independence test is used to analyze if the sampling decisions based on the hash input and the hash function is independent to packet attributes.

The last requirement is that of the representativeness of the sampled sub­set. The distribution of packet attributes in the sampled subset have to be the same as in the population. The chi-square distribution test also known as the Pearsons’s goodness-of-fit test compares the proportion of packet attributes in the sampled subset and the population. It will be shown that unbiased sampling leads to representative subsets. The chi-square independence test will be derived from the chi-square distribution test in order to prove that hash functions that sample unbiased to a certain packet attribute also select a representative subset too.

Following requirements are proposed by Molina [MoND05]. These requirements may be fulfilled for hash functions intended for hash-based selection, but should be fulfilled for hash functions creating packet IDs.

- uniform distribution
- low collisions

Uniform distribution and low collision probability are two concurring criteria for hash functions. In order to obtain low collisions the hash function has to distribute the has values uniformly over the hash range. For hash-based selection it is more convenient to adjust the selection range for a certain target sampling ratio if the hash values distribute uniformly although uniformity is not a prerequisite. For uniform hash values and a certain target sampling fraction t %, the selection range has to consist of t% values of the hash range. For non­uniform hash value distributions the selection range has to be delicately adjusted according to the hash value distribution (empirically measured) in order to obtain a certain target sample size.

Collisions are critical for cryptographic purposes like verifying the integrity of data (see section 4.3). The hash function should be collision resistant for packet ID generation as well. Packet IDs are intended for distinguishing different packets at the multipoint collector. In case of packet ID collisions packets may not be properly distinguished and need to be discarded which decreases the techniques efficiency. For hash-based selection collisions do not matter as long as they appear randomly and there is no dependency between any packet property of the colliding packets. Random colliding packets are still representatively and unbiasedly selected.

4.2 Collection of Hash Functions

The computational overhead introduced by hash-based packet selection is critical for multipoint measurements. The resource consumption at the observation point has to be minimized. From previous tests [MoND05] it is already known that CRC32 is already very slow for hash-based selection purposes. In comparisons of cryptographic hash functions [Dai] [Erh] it is shown that most cryptographic hash functions are even slower than CRC32. Therefore the hash function col­lection that is going to be analyzed consists of 23 fast non-cryptographic hash functions. All hash functions have a 32 bit digest (hash value) which makes it

illustration not visible in this excerpt

Table 7: Hash Function Collection

more easy to compare the functions. The cryptographic hash functions SHA1 and MD5 are added to the collection as well. These cryptographic functions shall be compared to the other functions in order to assess if they show better qual­ity. The SHA-128 bit and MD5-160 bit hash values are trimmed to 32 bits by adding up each 32 bit subblock of them. The functions were available at various online sources [Lov][Adl][Jen][Mul][Neu][RFC3174][Plu][HaKr97][Nol][Schm]. An overview of the Collection is shown in Table 7.

4.3 Security Issues for Hash-Based Selection

A recent paper published by Goldberg and Rexford [GoRe07] deals with secu­rity vulnerabilities arising; with hash-based selection. In the context of packet sampling serurity is broken if an adversary can influence the selection process in a way, that the adversary’s packets are disproportionally included in the the sampled subset.

Need for keyed hash function Goldberg and Rexford demonstrate that only a keyed hash function is secure. A keyed hash functions appends some secret key to the hash input before calculating the hash value. Suppose a unkeyed hash function is used for hash-based packet selection and the hash function and selection range is known to the adversary. If the adversary does not want to be billed for his traffic he calculates the hash value prior to sending his packets and ensures by padding some additional bits that the hash value does not fall into the selection range.

Even if the selection range is secret the adversary can learn the range by looking at his bill. He crafts packets that have hash values in a small range A. If A is disjoint with the selection range S than the packets are not selected for accounting. Even if the packets are selected the adversary learns that A and S overlap and will use a different range A in the next billing period until he finds a range that is not included in S.

These examples demonstrate that secure hash-based selection requires a secret key and a private selection range at all observation points. The secret key can be 1) an initial value for the hash function 2) some bytes that are appended to the hash input or 3) some private hash function parameter.



ISBN (eBook)
ISBN (Book)
File size
1.8 MB
Catalog Number
Institution / College
Technical University of Berlin
evaluation hash functions multipoint sampling networks



Title: Evaluation of Hash Functions for Multipoint Sampling in IP Networks