Course  OR 778

Class Notes   9/12/01





Last Time:
 

    -    Improved analysis of tail of Response Size Distributions

    -    Pareto(1.2) gave acceptable (?) fit

    -    So did Pareto(1.5)  ??

    -    log normal was (surprisingly ?) close but inadequate

    -    how should we think about "heavy tails"????

    -    in context of:

Heavy tailed durations   Long Range Dependence









Interesting "news" report from UNC





Date: 31 Aug 2001 14:28:23 -0400
From: ITS Change <itschang@hapi.isis.unc.edu>
Newsgroups: unc.support
Subject: [support] MAJOR: Rate limits in place for certain file-sharing applications

At approximately 2:30PM on August 28th, a rate limit policy was instituted on the campus Internet router, limiting traffic from two file-sharing applications (KaZaA/Morpheus and Gnutella) to T1 network capacity for inbound traffic and T1 capacity for outbound traffic.  This action was taken to maintain the integrity of the campus network infrastructure and to ensure appropriate bandwidth for those applications that are critical to the education and research mission of the University.

Over the past week, we had noticed that the combined inbound and outbound traffic from the two aforementioned applications represented more than two to three times the traffic from all Web transactions across the campus Internet link.  In the case of the South Campus residence halls, we had noticed that the link was nearly saturated with approximately 65% of that traffic representing KaZaA and Gnutella traffic.  In those residence halls, a number of students had reported extreme difficulties accessing Web pages necessary for their course work during that period.

Given the need to ensure the availability of critical applications over the network, as well as the mounting costs associated with commercial Internet bandwidth (180 Mb/sec of commercial Internet bandwidth costs between $650,000 and $1,000,000 per year), ATN decided that it was necessary to impose a rate limit on these two applications (KaZaA and Gnutella).  We believe it is much more desirable to limit the traffic related to these applications than it is to block it all together.

ATN, through both the Security group and ResNet, has been educating the campus community on the issues associated with copyright and file sharing applications and will continue to do so.
 
 
 
 
 
 


Aside on Response Size Distributions







Interesting graphic by Felix Hernandez, UNC Computer Science
 

21 log(CCDF log) plots of response sizes:

    -    from 4 hour blocks

    -    taken at 3 different time periods

    -    for 7 consecutive days
 


 

Notes:

    -    general shapes similar to before

    -    surprisingly consistent "kinks"?

    -    i.e.  "tail index" changes in a systematic way

    -    motivates "tail index curve" idea??
 
 
 
 
 


Heavy tails and other fields









Key question:    what range of data is of interest?
 
 
 

Insurance:    Prob. of disasters generally beyond range of data?
 
 

Internet:   Care most about regions where have data
 
 

Finance:   ???
 
 
 
 
 


Why Care About Heavy Tails (for internet traffic)?







Current Folklore (for aggregated data):
 
 

Heavy tailed durations    Long Range Dependence

Toy Graphics, Exponential Durations

Toy Graphics, Pareto (1.5) Durations

(caused by the “few elephants”, but mice are there, too)








    -    Mandelbrot (60's)

    -    Paxson and Floyd (1995)

    -    Feldman, Gilbert and Willinger (1998)

    -    Riedi and Willinger (1999)

Go here for reference details









Investigation II:  Long Range Dependence?







Question 1:    Is it really there?
 
 

    -    Early conceptions:   no

(renders classical queueing theory useless?)







    -    Current thought:   yes
 

    -    Very recent work (Cleveland, et. al.):    not important
 

    -    Motivates a very careful look
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Time series 1:    Aggregated point process data,

1 million Packet Arrival times (from 1998), over ~ 3 minutes








Simple analysis:    time series of bin counts

(Caution:  different view of data from above Response Sizes)

Toy example Graphic
 
 

        10,000 bins,      ~100 obs’s per bin

        Binwidth  ~  0.02 sec
 
 
 
 


Classical Time Series Dependence Measure:
Autocorrelation







Correlation:

    -    Measure of “dependence” between variables

    -    0   for independent

    -    +1 (-1) for linear relationship with slope > (<) 0
 

Autocorrelation:    of time series 

    For “lag” 
 
 


Autocorrelation as a Dependence Measure (cont.)








Is this sensible?     (L2,  i.e. 2nd moment, based)
 

    -   likely misleading for heavy tailed distributions
 

    -    but here heavy tails are "horizontal"

(again see toy graphic)







    -    we are looking "vertically" at bin counts
 

    -    Poisson - Gaussian marginals a useful approx'n ???
 
 

A quick first look:

Incoming Counts

Outgoing Counts
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Autocorrelation plot
 

    -    View 1:  Approximate as:     WhiteNoise + AR(1).

AR(1) part has 
(from slope of    vs. (lag),
since )





    -    nearly “unit root”
 

    -    close to nonstationary random walk
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Autocorrelation Plot (cont.)
 

    -    View 2:   Hurst parameter  ~  0.86

(from slope of    vs. )

Periodogram based C.I. is: (0.82,1.06),
Based on analysis and graphics by Richard Smith







    -    0.86   Long Range Dependent, “self similar”, …
 

    -    Consistent with above “heavy tail” theory
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Recent controversy:
Cao, J., Cleveland, W. S., Lin, D. and Sun, D. X. (2001) The effect of statistical multiplexing on internet packet traffic: theory and empirical study.  Bell Labs Tech. Report.  Downloadable from here.

    -    study interarrival times, not bin counts

    -    fine scale structure is approx. Poisson process

    -    Long Range Dependence is there

    -    but only at larger time scales

    - not important for queueing considerations

(i.e. local queue length behavior at a buffer)











Investigation II:  Long Range Dependence?  (cont.)








Controversy motivates question:

    How does dependence (autocorr.) change across scales?
 
 

Approach:

Zooming Autocorrelation 1  (for same 1 mil. packets):

    -    change binwidth: 

    -    so # of bins: 

    -    and obs’s / bin: 
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Zooming Autocorrelation 1:
 

    -    smallest scale nearly uncorr’d   (Cleveland)
 

    -    Correlation “lifts vertically”???
 

    -    gets to long range dependence (folklore)
 

    -    for larger scale, sample noise dominates?
 
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)








Unexpected feature?
 

    -    Dependence “lifts vertically”
 

    -    Instead of “coming from right”
 
 

Time span   for lag , at scale   is 
 
 

So “dependence at time scale ”,

as   increases, should appear at lag .













Investigation II:  Long Range Dependence?  (cont.)








Notion of large lump on right (in autocorr.):

consistent with “periodicities”?.








Caution 1:    periodicities    large lump,

but not clear that      large lump    periodicity








Caution 2:    TCP has its periodicites

Individual TCP connection zooming graphic