Class Notes 9/24/01
Last Time (self contained):
- Overview of time series basics
- Classical theory
- stationarity, autocorrelation, AR(I)MA,
- spectral analysis
- Long Range Dependence
- autocorrelation and spectral characterization
- Fractional ARIMA
- Hurst scaling (expanding histogram graphic)
Time before last:
- in context of:
- Zooming autocorrelation analysis
- Bin counts nearly independent at small scales
- Heavy dependence at large scales
- Heading toward zooming SiZer analysis
1st doing SiZer
Philosophical Aside:
What are the basic questions
being addressed here?
I.e. what do network researchers
really want to know?
Context 1: Classical Applied Statistics
- i.e. statistical consulting
- before analyzing data, identify research question
- frequently different from what is first asked!
- excellent paradigm to avoid "finding what isn't real"
(5% of true null H0s are "rejected")
Philosophical Aside (cont.)
Context 2: Challenging and "vague" scientific areas
- bioinformatics (genomics, etc.)
- data mining ("finding info" in large data bases)
- electronic security, e.g. "intrusion defense"
- mathematical finance???
internet traffic
Common aspects of such research:
- unclear what basics questions are
- they are developed in interaction with analysis
- close collaboration is vital
- "multiple comparison" issues are
endemic and challenging
Investigation III: Zooming SiZer
Idea: Study "dependence" in terms of
"non-stationarity in mean"
Recall SiZer
finds "significant slopes"
Need for zooming: to
view wide range of scales
SiZer Background
settings: scatterplot smoothing and histograms
Fossils data
Incomes data
- Central Question:
Which features are “really there”?
Solution Part I, Scale Space
Solution Part II, SiZer
SiZer Background (cont.)
of Fossil Data (local linear)
dotted line: undersmoothed (feels sampling variability)
dashed line: oversmoothed (important features missed?)
solid line: smoothed about right?
Central question: Which
features are “really there”?
SiZer Background (cont.)
Smoothing Setting 2: Histograms
Family Income Data: British Family Expenditure Survey, 1975
- Distribution of Incomes
~ 7000 families
Kernel Density Estimation Analysis:
- Again under- and over- smoothing issues
Perhaps 2 modes in data?
Central question: Which features are “really there”?
(e.g. 2 modes?)
SiZer Background (cont.)
“Scale Space” – idea from
Computer Vision
Conceptual basis:
- Oversmoothing = “view from afar” (macroscopic)
Undersmoothing = “zoomed in view” (microscopic)
Main idea: all smooths contain useful information,
so study “full spectrum” (i. e. all smoothing levels)
Fun views: Spectrum
Overlay & Spectrum Suface
Note: this viewpoint makes
“data based bandwidth selection”
much less important (than I once thought….)
SiZer Background (cont.)
Significance of Zero crossings,
of the derivative, in scale space
- needed statistical inference
- novel visualization
To get: a powerful exploratory
data analysis method
Chaudhuri, P. and Marron,
J. S. (1999) SiZer for exploration of structure in curves, Journal of
the American Statistical Association, 94, 807-823.
SiZer Background (cont.)
Basic idea: a “bump” is characterized by:
an increase, followed by a decrease
Generalization: many “features of interest” captured by
sign of the slope of the smooth
SiZer Basis:
Statistical inference on slopes, over scale space
SiZer Background (cont.)
Visual presentation:
map over scale space:
- Blue:
slope significantly upwards (deriv . CI above 0)
- Red:
slope significantly downwards (der. CI below 0)
- Purple:
slope insignificant (deriv. CI contains 0)
SiZer Background (cont.)
SiZer analysis of Fossils data:
Upper Left: Scatterplot,
family of smooths, 1 highlighted
Upper Right: Scale space
rep’n of family, with SiZer
Lower Left: SiZer
map, more easy to view
Lower Right: SiCon map –
replace "slope" by "curvature"
Slider (in movie viewer)
highlights different smoothing levels
SiZer Background (cont.)
analysis of Fossils data (cont.)
Decreases at left, not on right
Medium smoothed:
- Main valley significant, and left most increase
smaller valley not statistically significant
“noise wiggles” not significant
Additional SiZer
color: gray not enough data for inference
SiZer Background (cont.)
analysis of Fossils data (cont.)
Common Question: which is “right”?
- decreases on left, then flat
- up, then down, then up again
no significant features
Answer: All are “right”, just different “scales of view”,
i.e. “levels of resolution of data”
SiZer Background (cont.)
analysis of Incomes data:
Oversmoothed: Only one mode
Medium smoothed: Two modes statistically significant
Confirmed by PhD dissertion of H. P. Schmitz (U. Bonn):
Schmitz, H. P. and Marron,
J. S. (1992) Simultaneous estimation of several size distributions of
income, Econometric Theory, 8, 476-488.
Undersmoothed: many “noise
wiggles”, not significant
Again: all are “correct”,
just different “scales”
SiZer Background (cont.)
Simulated example 1: Marron
- Wand Trimodal, #9
only one mode "significant"
two modes now "appear from background noise"
finally all 3 modes are "really there"
Simulated example 2: Marron - Wand Discrete Comb, #15
- similar lessons to above
someday: "draw" local bandwidth on SiZer
SiZer Background (cont.)
Finance "tick data":
(time, price) of single stock transactions
Idea: "on line" version
of SiZer
for viewing and understanding
- "trends" depend heavily on "scale"
- "double points" and more
"background color" transition
SiZer Background (cont.)
Usefulness of SiZer
in exploratory data analysis:
Smoothing experts: saves time
- Smoothing beginners: avoids terrible mistakes
- don’t find things that “aren’t there”
- do find important features
- Directly targets critical scientific question:
Is a deeper analysis worthwhile?
SiZer Background (cont.)
Would you like to try a SiZer
Matlab software:
JAVA version (demo, beta):
Follow the SiZer
link from the
Wagner Associates home page:
More details, examples and discussions:
Investigation III: Zooming SiZer (cont.)
Recall time series 1: Aggregated point process data,
1 million Packet Arrival times (from 1998), over ~ 3 minutes
Recall 1st zooming
autocorrelation plot
smallest scale nearly uncorr’d (Cleveland)
Correlation “lifts vertically”
gets to long range dependence (folklore)
Investigation III: Zooming SiZer (cont.)
Alternate view:
Zooming SiZer
- local linear smoothing of bincounts
to avoid "edge effects"
across very wide range of scales
needs more pixels than screen allows
thus do zooming view (zoom in over time)
zoom in to yellow bd’ry in next frame
readjust vertical axis
Investigation III: Zooming SiZer (cont.)
Notes on Zooming SiZer:
- Coarse scales: amazing amount of "significant structure"
- reminiscent of “self-similar fractal” type process
- fewer significant features at small scale
- but they exist, so not Poisson process
- Poisson approximation OK at small scale???
- smooths (top part) "stable" at large scales?
variation dimishes as mean increases?
Investigation III: Zooming SiZer (cont.)
Is this "significant structure"
really important?
Simple comparison:
SiZer analysis of 1 million i.i.d. uniforms
- SiZer map all purple, i.e. no structure
- except at edges
- due to using kernel density estimation
- Shows internet data wiggles are statistically significant
But "practically significant"????