Lecture9-12-01

Course OR 778

Class Notes 9/12/01

Last Time:

- Improved analysis of tail of Response Size Distributions

- Pareto(1.2) gave acceptable (?) fit

- So did Pareto(1.5) ??

- log normal was (surprisingly ?) close but inadequate

- how should we think about "heavy tails"????

- in context of:

Heavy tailed durations

Long Range Dependence

Interesting "news" report from UNC

Date: 31 Aug 2001 14:28:23 -0400
From: ITS Change <itschang@hapi.isis.unc.edu>
Newsgroups: unc.support
Subject: [support] MAJOR: Rate limits in place for certain file-sharing applications

At approximately 2:30PM on August 28th, a rate limit policy was instituted on the campus Internet router, limiting traffic from two file-sharing applications (KaZaA/Morpheus and Gnutella) to T1 network capacity for inbound traffic and T1 capacity for outbound traffic. This action was taken to maintain the integrity of the campus network infrastructure and to ensure appropriate bandwidth for those applications that are critical to the education and research mission of the University.

Over the past week, we had noticed that the combined inbound and outbound traffic from the two aforementioned applications represented more than two to three times the traffic from all Web transactions across the campus Internet link. In the case of the South Campus residence halls, we had noticed that the link was nearly saturated with approximately 65% of that traffic representing KaZaA and Gnutella traffic. In those residence halls, a number of students had reported extreme difficulties accessing Web pages necessary for their course work during that period.

Given the need to ensure the availability of critical applications over the network, as well as the mounting costs associated with commercial Internet bandwidth (180 Mb/sec of commercial Internet bandwidth costs between $650,000 and $1,000,000 per year), ATN decided that it was necessary to impose a rate limit on these two applications (KaZaA and Gnutella). We believe it is much more desirable to limit the traffic related to these applications than it is to block it all together.

ATN, through both the Security group and ResNet, has been educating the campus community on the issues associated with copyright and file sharing applications and will continue to do so.

Aside on Response Size Distributions

Interesting graphic by Felix Hernandez, UNC Computer Science

21 log(CCDF log) plots of response sizes:

- from 4 hour blocks

- taken at 3 different time periods

- for 7 consecutive days

Notes:

- general shapes similar to before

- surprisingly consistent "kinks"?

- i.e. "tail index" changes in a systematic way

- motivates "tail index curve" idea??

Heavy tails and other fields

Key question: what range of data is of interest?

Insurance: Prob. of disasters generally beyond range of data?

Internet: Care most about regions where have data

Finance: ???

Why Care About Heavy Tails (for internet traffic)?

Current Folklore (for aggregated data):

Heavy tailed durations Long Range Dependence

Toy Graphics, Exponential Durations

Toy Graphics, Pareto (1.5) Durations

(caused by the “few elephants”, but mice are there, too)

- Mandelbrot (60's)

- Paxson and Floyd (1995)

- Feldman, Gilbert and Willinger (1998)

- Riedi and Willinger (1999)

Go here for reference details

Investigation II: Long Range Dependence?

Question 1: Is it really there?

- Early conceptions: no

(renders classical queueing theory useless?)

- Current thought: yes

- Very recent work (Cleveland, et. al.): not important

- Motivates a very careful look

Investigation II: Long Range Dependence? (cont.)

Time series 1: Aggregated point process data,

1 million Packet Arrival times (from 1998), over ~ 3 minutes

Simple analysis: time series of bin counts

(Caution: different view of data from above Response Sizes)

Toy example Graphic

10,000 bins, ~100 obs’s per bin

Binwidth ~ 0.02 sec

Classical Time Series Dependence Measure:
Autocorrelation

Correlation:

- Measure of “dependence” between variables

- 0 for independent

- +1 (-1) for linear relationship with slope > (<) 0

Autocorrelation: of time series

For “lag” ,

Autocorrelation as a Dependence Measure (cont.)

Is this sensible? (L2, i.e. 2nd moment, based)

- likely misleading for heavy tailed distributions

- but here heavy tails are "horizontal"

(again see toy graphic)

- we are looking "vertically" at bin counts

- Poisson - Gaussian marginals a useful approx'n ???

A quick first look:

Incoming Counts

Outgoing Counts

Investigation II: Long Range Dependence? (cont.)

Autocorrelation plot

- View 1: Approximate as: WhiteNoise + AR(1).

AR(1) part has

(from slope of

vs.

(lag),
since

)

- nearly “unit root”

- close to nonstationary random walk

Investigation II: Long Range Dependence? (cont.)

Autocorrelation Plot (cont.)

- View 2: Hurst parameter ~ 0.86

(from slope of

vs.

)

Periodogram based C.I. is: (0.82,1.06),
Based on analysis and graphics by Richard Smith

- 0.86 Long Range Dependent, “self similar”, …

- Consistent with above “heavy tail” theory

Investigation II: Long Range Dependence? (cont.)

Recent controversy:
Cao, J., Cleveland, W. S., Lin, D. and Sun, D. X. (2001) The effect of statistical multiplexing on internet packet traffic: theory and empirical study. Bell Labs Tech. Report. Downloadable from here.

- study interarrival times, not bin counts

- fine scale structure is approx. Poisson process

- Long Range Dependence is there

- but only at larger time scales

- not important for queueing considerations

(i.e. local queue length behavior at a buffer)

Investigation II: Long Range Dependence? (cont.)

Controversy motivates question:

How does dependence (autocorr.) change across scales?

Approach:

Zooming Autocorrelation 1 (for same 1 mil. packets):

- change binwidth: