Class Notes 9/24/01
Last Time (self contained):
- Overview of time series basics
- Classical theory
- stationarity, autocorrelation, AR(I)MA,
- spectral analysis
- Long Range Dependence
- autocorrelation and spectral characterization
- Fractional ARIMA
- Hurst scaling (expanding histogram graphic)
Time before last:
- in context of:
- Zooming autocorrelation analysis
- Bin counts nearly independent at small scales
- Heavy dependence at large scales
- Heading toward zooming SiZer analysis
-
1st doing SiZer
background
Philosophical Aside:
What are the basic questions
being addressed here?
I.e. what do network researchers
really want to know?
Context 1: Classical Applied Statistics
- i.e. statistical consulting
- before analyzing data, identify research question
- frequently different from what is first asked!
- excellent paradigm to avoid "finding what isn't real"
(5% of true null H0s are "rejected")
Philosophical Aside (cont.)
Context 2: Challenging and "vague" scientific areas
- bioinformatics (genomics, etc.)
- data mining ("finding info" in large data bases)
- electronic security, e.g. "intrusion defense"
- mathematical finance???
-
internet traffic
Common aspects of such research:
- unclear what basics questions are
- they are developed in interaction with analysis
- close collaboration is vital
- "multiple comparison" issues are
endemic and challenging
Investigation III: Zooming SiZer
Idea: Study "dependence" in terms of
"non-stationarity in mean"
Recall SiZer
finds "significant slopes"
Need for zooming: to
view wide range of scales
SiZer Background
-
settings: scatterplot smoothing and histograms
-
Fossils data
-
Incomes data
- Central Question:
Which features are “really there”?
-
Solution Part I, Scale Space
-
Solution Part II, SiZer
SiZer Background (cont.)
Smooths
of Fossil Data (local linear)
-
dotted line: undersmoothed (feels sampling variability)
-
dashed line: oversmoothed (important features missed?)
-
solid line: smoothed about right?
Central question: Which
features are “really there”?
SiZer Background (cont.)
Smoothing Setting 2: Histograms
Family Income Data: British Family Expenditure Survey, 1975
- Distribution of Incomes
-
~ 7000 families
Kernel Density Estimation Analysis:
- Again under- and over- smoothing issues
-
Perhaps 2 modes in data?
Central question: Which features are “really there”?
(e.g. 2 modes?)
SiZer Background (cont.)
“Scale Space” – idea from
Computer Vision
Conceptual basis:
- Oversmoothing = “view from afar” (macroscopic)
-
Undersmoothing = “zoomed in view” (microscopic)
Main idea: all smooths contain useful information,
so study “full spectrum” (i. e. all smoothing levels)
Fun views: Spectrum
Overlay & Spectrum Suface
Note: this viewpoint makes
“data based bandwidth selection”
much less important (than I once thought….)
SiZer Background (cont.)
SiZer:
Significance of Zero crossings,
of the derivative, in scale space
Combines:
- needed statistical inference
- novel visualization
To get: a powerful exploratory
data analysis method
Chaudhuri, P. and Marron,
J. S. (1999) SiZer for exploration of structure in curves, Journal of
the American Statistical Association, 94, 807-823.
SiZer Background (cont.)
Basic idea: a “bump” is characterized by:
an increase, followed by a decrease
Generalization: many “features of interest” captured by
sign of the slope of the smooth
SiZer Basis:
Statistical inference on slopes, over scale space
SiZer Background (cont.)
Visual presentation:
Color
map over scale space:
- Blue:
slope significantly upwards (deriv . CI above 0)
- Red:
slope significantly downwards (der. CI below 0)
- Purple:
slope insignificant (deriv. CI contains 0)
SiZer Background (cont.)
SiZer analysis of Fossils data:
Upper Left: Scatterplot,
family of smooths, 1 highlighted
Upper Right: Scale space
rep’n of family, with SiZer
colors
Lower Left: SiZer
map, more easy to view
Lower Right: SiCon map –
replace "slope" by "curvature"
Slider (in movie viewer)
highlights different smoothing levels
SiZer Background (cont.)
SiZer
analysis of Fossils data (cont.)
Oversmoothed:
-
Decreases at left, not on right
Medium smoothed:
- Main valley significant, and left most increase
-
smaller valley not statistically significant
Undersmoothed:
-
“noise wiggles” not significant
Additional SiZer
color: gray not enough data for inference
SiZer Background (cont.)
SiZer
analysis of Fossils data (cont.)
Common Question: which is “right”?
- decreases on left, then flat
- up, then down, then up again
-
no significant features
Answer: All are “right”, just different “scales of view”,
i.e. “levels of resolution of data”
SiZer Background (cont.)
SiZer
analysis of Incomes data:
Oversmoothed: Only one mode
Medium smoothed: Two modes statistically significant
Confirmed by PhD dissertion of H. P. Schmitz (U. Bonn):
Schmitz, H. P. and Marron,
J. S. (1992) Simultaneous estimation of several size distributions of
income, Econometric Theory, 8, 476-488.
Undersmoothed: many “noise
wiggles”, not significant
Again: all are “correct”,
just different “scales”
SiZer Background (cont.)
Simulated example 1: Marron
- Wand Trimodal, #9
n=100:
only one mode "significant"
n=1000:
two modes now "appear from background noise"
n=10,000:
finally all 3 modes are "really there"
Simulated example 2: Marron - Wand Discrete Comb, #15
- similar lessons to above
-
someday: "draw" local bandwidth on SiZer
map
SiZer Background (cont.)
Finance "tick data":
(time, price) of single stock transactions
Idea: "on line" version
of SiZer
for viewing and understanding
trends
Notes:
- "trends" depend heavily on "scale"
- "double points" and more
-
"background color" transition
SiZer Background (cont.)
Usefulness of SiZer
in exploratory data analysis:
-
Smoothing experts: saves time
- Smoothing beginners: avoids terrible mistakes
- don’t find things that “aren’t there”
- do find important features
- Directly targets critical scientific question:
Is a deeper analysis worthwhile?
SiZer Background (cont.)
Would you like to try a SiZer
analysis?
Matlab software:
http://www.stat.unc.edu/faculty/marron/marron_software.html
JAVA version (demo, beta):
Follow the SiZer
link from the
Wagner Associates home page:
http://www.wagner.com/www.wagner.com/SiZer/
More details, examples and discussions:
http://www.stat.unc.edu/faculty/marron/DataAnalyses/SiZer_Intro.html
Investigation III: Zooming SiZer (cont.)
Recall time series 1: Aggregated point process data,
1 million Packet Arrival times (from 1998), over ~ 3 minutes
Recall 1st zooming
autocorrelation plot
-
smallest scale nearly uncorr’d (Cleveland)
-
Correlation “lifts vertically”
-
gets to long range dependence (folklore)
Investigation III: Zooming SiZer (cont.)
Alternate view:
Zooming SiZer
- local linear smoothing of bincounts
to avoid "edge effects"
-
across very wide range of scales
-
needs more pixels than screen allows
-
thus do zooming view (zoom in over time)
-
zoom in to yellow bd’ry in next frame
-
readjust vertical axis
Investigation III: Zooming SiZer (cont.)
Notes on Zooming SiZer:
- Coarse scales: amazing amount of "significant structure"
- reminiscent of “self-similar fractal” type process
- fewer significant features at small scale
- but they exist, so not Poisson process
- Poisson approximation OK at small scale???
- smooths (top part) "stable" at large scales?
-
variation dimishes as mean increases?
Investigation III: Zooming SiZer (cont.)
Is this "significant structure"
really important?
Simple comparison:
SiZer analysis of 1 million i.i.d. uniforms
- SiZer map all purple, i.e. no structure
- except at edges
- due to using kernel density estimation
- Shows internet data wiggles are statistically significant
-
But "practically significant"????