Statistics 6D,   Visualizing Data

Class Notes:  Thursday 9/10/02
 
 


    -    Check new material on student pages (from Class Home Page)

    -    Excel construction of histograms (from Computing Tips)

(need to finish "chosen bins" part)




Continue Analysis of Buffalo Snowfall data...

Background:    City of Buffalo, N.Y., known for heavy snows

Data:    TIme Series of annual accumulated snow falls (inches)
 


Recall Excel default histogram constructed in:  Toy Example Excel File
 

Comments:

    -    Excel chose binwidth  =  ~14

    -    Only 8 bins chosen, too large?

    -    Too few bins for "serious structure"?

    -    Note one year unusually small
 
 


Binwidth deliberately "too small"

    -    Tried binwidth  =  3

    -    Requires many bins to include all the data

    -    Histogram looks "very bumpy"

    -    Hard to see "large scale features of distribution"
 
 


Binwidth "clearly too big"

    -    Tried binwidth  =  30

    -    10 times as big as above

    -    Averages taken over too big a range

    -    Obscures potential interesting population structure
 
 


Binwidth "about right"???

    -    Tried binwidth  =  10

    -    "in between" above 2?

    -    large enough to remove "sampling artifacts"?

    -    Small enough to suggest 3 modes?

    -    Interesting question:    are modes "important underlying structure"???
 


Again highlights important issue for histograms:   choice of binwidth
 
 

Recommendation:  try several binwidths

    Including both too big, and too small
 
 


Third Class Assignment:    Explore a new data set with histograms

    -    Start with data in spreadsheet StudyHabitsIndexData.xls

            *    Number attempt to quantify "quality of study habits

            *    Measured for 18 females and 20 males

            *    How do the populations compare???

    -    Address this question by an Excel analysis based on histograms

            *    Just try something, then we compare and discuss

    -    Display your results and conclusions on a new web page

            *    Linked to your home page

            *    You select format and style of presentation

            *    But insert some graphics generated by Excel

    -    Some graphics ideas to consider:

            *    Look at two separate histos, or some "combined version"???

            *    I.e. single graphic showing both "together" (experiment with Excel)

            *    Answers depend on binwidth, how to effectively display several?

    -    Some additional questions (answer on your web page, w/ discussion):

            *    Which group "looks better on average"?

            *    Can you "quantify this idea"?  (e.g. give numerical measures)

            *    Which group "looks more spread" (i.e. has "greater variation")

            *    Quantify this idea by using the STDEV function in Excel

            *    Suppose you are an employer who must hire
                     somebody from one of the two groups.
                     Would you hire a female or a male, if:

                        +    You are forced to choose "at random"

                        +    You can carefully select from a large group of each type

                     Why?
 
 


Fun Questions:

    -    How should data be gathered?

    -    Does it make much difference??

    -    Are larger samples always better???
 


An interesting historical context:

Political polls for presidential elections



Source (also has additional related information):


1936:   Roosevelt vs. Landon

Popular Poll:    Literary Digest Magazine

    -    Correctly called every election since 1916

    -    Mailed survey to 10 million voters

    -    Got 2.4 million responses

    -   Largest political poll ever in history!
 

Results:

"Landslide" for Roosevelt, but Literary Digest totally missed!
 


Why???

    +    Problem with how the sample was chosen!

    +    Who got the survey form?     ("selection bias")

            -    Literary Digest readers

            -    Addresses from phone books

            -    Addresses from country club membership lists, ....

    +    Who filled out the form?    ("nonreponse bias")

            -    Makes sample "even less representative" of population...

    +    When sampling is biased, bigger sample size doesn't help,

Only repeats the mistake on a larger scale!

Big Lesson:    need a sample that is representative of the population
 
 


An alternate survey method:   quota sampling

    -    Done by Gallup poll in 1936

    -    Idea:  try hard to "make sample like population"

    -    Avoid non-response bias by personal interviews

(today done by telephone)

    -    Each interviewer has quotas:

                ___%  male
                ___%  income groups
                ___%  religion ....

    -    Used sample of only 50,000 (<< 2.4 million) to correctly call election

    -    Also correctly called L. D.'s bad prediction

(by asking: "did you return the L. D. survey?")

    -    Quota sampling was used successfully until....
 


1948:    Truman vs. Dewey
 

The polls, and the results:

Famous Picture:    Truman smiling with newspaper saying "Dewey Wins"
 

Why???    Problem with quota sampling?

"Unintential bias"    -    consequence of "human choice" of pollsters

    -    E.g. may prefer to search for quota in "nicer neighborhood"

    -    Always gave 5 - 6% error (also in 1936)

    -    Only mattered for this close election
 

Main Lesson:   Can't get a "representative sample" by human choice!
 


Toy E.g:    Choose a "random number" from {1,2,3,4}

Interesting Fact:    Too many people tend to choose 3
 

Solution:    Choose samples "at random"
 

Paradoxical terminology:  Random sampling is called "scientific sampling"

(has a better sound)





What does "random sampling" mean?

Each member of the population is "equally likely to be in sample".
 

Toy E.g:    Use mechanism where each of 1,2,3,4 is chosen "1/4 of the time:
 

Note:    This motivates study of probability theory
 
 

Added payoff to "scientific" (random!) sampling:

Can use Probability Theory to quantify uncertainty!

(learn about this in other statistics courses)
 
 


Generate random numbers using Excel,

Part 10, in Computing Tips
 
 


Back to Statistics 6D Home Page