Hi Shikharesh, Hi Azzedine,
Thanks very much for the excellent reviews of the
Performance Evalution submission "Variable Heavy Tails in
Internet Traffic" by Hernandez-Campos, Samorodnitsky, Smith
and I. In addition to the very detailed and careful
attention to detail, we also appreciate the philosophical
issues that the reviewers have raised.
Many of the points made were right no target, and we
have just implemented sutiable changes. Points that seem to
need additional discussion are:
Referee 1:
Overuse of quotation marks: The point is very well taken,
and the analysis very clear. We have tried hard to address
this problem, in particular following these recommendations.
Page 6: Yes, this is on target. We have added more
discussion in the Size Distribution Analysis section.
Referee 2:
Major Comments:
1. There are different personal opinions on this type of
organization, with probably something to be said for all
sides of the issue. But as there seems to be some strong
feelings on the matter at this point, we have instituted
these changes. The potential for confusion between sizes
and durations was a previously muriky issue that we have
worked hard to clarify.
2. We prefer distributions that arise from "naturally
occuring phenomena". The Gaussian is the best known of
these (the natural distribution associated with summing and
averaging), and other well known ones include the Poisson,
the log-normal, and various types of mixture distributions.
In our view, a big plus of all of the distributions we use
is that they are also of this type. We are aware of earlier
work where distributions are constructed in this piecewise
way, but view this as a serious disadvantage, because we
don't know of physical processes which give rise to such
distributions. For us this outweighs the fact one may on
occasion get slightly better fits from such distributions.
We note further that many of these early papers were fitting
data before the Double Pareto log normal distribution was
defined. But now that the distribution is known and
understood, it seems to make sense to use it as a natural
model when the data have this special form, instead of these
unnatural piecewise things. Such approaches were sensible
to use when that was all that could be done. But with the
advent of the double Pareto log normal distribution, that is
no longer the case. We recognize that this choice is
personal, and understand that people with other personal
criteria will make other choices. We contemplated adding
discussion on this point to the paper, but decided against
it on the grounds that this point is already well treated in
Gong, et al. Also we do not view it as appropriate to say
negative things about the work of Barford, Arlitt and others
because what they did was quite sensible in view of what was
known at the time.
3. This discussion gets into "statistical style" issues.
One way of dichotomizing statistical methods is into
it/{exploratory} and it/{confirmatory} analyses. Most good
quality data analyses involve both types, first working in
exploratory mode to understand what is happening, and then
working in confirmatory mode to be sure the ideas are
correct. In this paper we have worked mostly in exploratory
mode. The simulated envelope in the Q-Q plot is something
of a hybrid, in that makes confirmatory suggestions while
working really in exploratory mode. One could indeed push
this further, in a more precise statistical way, by properly
modelling the variation in these curves. But we are
skeptical that this will be worth the effort, and instead
believe that for our exploratory purposes, the present
approach is adequate (and our statistical energies are
better devoted to other issues). Another reason for not
pursuing confirmatory analysis here is that with millions of
data points, it makes much less sense to use say classical
goodness of fit methods. This is because with such large
data sets there is a huge amount of power in the data which
is expected to result in the rejection of most any
distribution (caused by very small scale departures from the
given distribution). About Figure 4 and the "common wobbles
for all 21", the idea was that if the wobbles were random
phenomena, then one would expect them to appear in different
locations for different realizations. We added a sentence
to clarify this point in the paper. About applying the
simulated envelope analysis to all of the 21 time blocks:
sure we did this, and as noted in Section 2.1 of our
original submission, we found very similar results (as
expected from Figure 4). We also referred to an earlier
paper which gave a web link. To make the point more clear,
we now state explicitly in the test that Hernández-Campos,
et al, (2003) is a web site.
4. Yes, we were vague on this point. We've added some
discussion at the end of the Variable Tail Index Section.
The location of the "start of the tail" is well known in the
world of extreme value theory to be a very slippery issue.
We are unaware of any reasonable automatic way of doing this,
and in fact after understanding the "variable tail" lessons
that this paper is about, it seems that we have provided new
understanding as to why this problem can be so challenging.
5. This returns to the exploratory vs confirmatory
statistical goals discussed in point 3 above. The concern
about "arbitraryness" and "may lead to biases" seems a
little ironic, because the point of our adding envelopes of
simulated curves (which I believe nobody has previously done
in this literature) is precisely to address these issues.
But there is a clear message here that we need to give
better guidelines about how to do this properly, so we have
done so in the section on Pareto Tail Fitting. We have not
chosen to compute the ks statistic, because we don't think
it is very interpretable, as noted in point 3 above (recall
the point about nothing fitting precisely for such large
data sets).
6, One could always look at more and more data sets, and
their are generally good reasons such as those raised for
doing so. But one also needs to stop and publish at some
point, and it seems OK to us to stop at this point. For
this paper we already expanded beyond those that we had done
done for the MASCOTS and Allerton proceedings.
7. Regular variation is a standard concept in the branch of
probability called extreme value theory. Thanks for
pointing out that we should include a reference.
8. We have discussed this point right at the very end.
Referee 3:
1. Yes, there are many ways to organize a paper, and there
does not seem to be an overall notioin of best, probably
because of differing personal preferences. We prefer the
organization that we used, and would like to call this an
issue on which "two good authors may differ".
2. Good point, we have worked hard on this.
Best,
Steve