California State University, Hayward

Journal of Statistics Education v.3, n.2 (1995)

Copyright (c) 1995 by Bruce E. Trumbo, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

*Restrictions.* The programs accompanying this article are
also copyrighted 1995 by Bruce E. Trumbo, all rights reserved.
Permission to use the programs and any portion of this article
for any nonprofit educational purpose is hereby granted, except
that any use that involves gambling for money or for items or
services of value is not permitted.

Although extensively tested, the computer programs should be regarded as developmental. They may contain errors, some of which may cause unpredictable results, including computer "crashes." Source code is not available. Please report errors to the author. NO WARRANTY OR REPRESENTATION OF FITNESS FOR ANY PURPOSE IS MADE OR IMPLIED. The user assumes all risks of any kind and waives the right to claim damages.

*Requirements.* Programs in this series are intended for
use with IBM-PC compatible machines equipped with EGA graphics
or better, although some of them may run successfully on other
machines.

**Key Words**: Bivariate normal distribution; Correlation;
Expectation; Hypergeometric distribution; Gambling; Keno;
Pedagogy; Simulation.

In this second paper of a series, two programs for EGA-equipped IBM-PC compatible machines are included with indications of their pedagogical uses in the teaching of elementary probability and statistics. Concepts illustrated include the coefficient of correlation, the expectation of a discrete distribution, the concept of a fair game, and the hypergeometric distribution. Three datasets useful for illustrating correlation are also documented and appended.

1 As in the first paper in this series (Trumbo 1994), the emphasis here is on building student intuition and understanding of probability concepts through the use of simple computer programs in class. Both of the programs made available with this article are based on simulation. Of course, the results of simulations can be presented without having a computer, but the advantage of interactive computer use is that simulations can be repeated often enough that general principles can be discerned above the inevitable eccentricities of each individual simulation run.

2 The package of probability demonstration programs included with Trumbo (1994) consisted of the following: BRUN40.EXE (Microsoft utility), PROBDEMO.EXE (Entry program), PDMEN.EXE (Menu program), PDLLN.EXE (Part 1), and PDPPR.EXE (Part 2). The present article adds PDBNS.EXE (Part 3) and PDKEN.EXE (Part 4) to the list of programs released.

3 Each of the four demonstration programs released to date is called from a Main Menu activated by the command PROBDEMO. The menu program is written to detect the presence of these four programs (as well as others that may be released later), displaying only programs currently loaded into the same directory. The title page of each program tells how to start running it. For all except the program in Part 4, the F1 and F2 keys give technical details and additional options for more advanced users.

4 These programs were written for student use in computer labs, and most of the material provided here is intended to help instructors and students use them to the best advantage in that setting. However, if suitable projection equipment is available, the programs could also be used in large lecture sessions, perhaps with commentary partially based on the laboratory notes provided.

5 Readers are referred to Trumbo (1994) for further comments on the rationale, design, development, testing, advantages, limitations, and classroom use of the programs in this series.

6 Part 3 is intended to give students an intuitive grasp of the correlation coefficient, including an understanding of the meaning of various numerical values of correlation between -1 and +1. (Advanced students can benefit from establishing some of the distribution theory involved in the workings of the program.)

7 Part 4 allows students to play a simulation of the casino game Keno. It provides insight into the lure of a gambling game in which highly advertised large winnings are possible, but at which the player will, on average, lose more than a quarter of the money bet. The hypergeometric probability distribution is used to compute probabilities and expected winnings for this game---two crucial kinds of information not generally provided by casinos.

8 This is a simulated drawing of 500 observations from a bivariate normal distribution. Both means are fixed at zero and both standard deviations are fixed at unity. The user may select any value of the population correlation \rho between -1 and 1.

9 A bivariate normal density function can be viewed as a mound-shaped surface above a plane. The shape of the mound varies as \rho varies, but it is always possible to find an ellipse in the plane above which exactly 95% of the probability (volume) of the mound lies. The shape of this ellipse can vary from circular (for \rho = 0) to "football" shaped (for \rho around +.7 or -.7) to "cigar" shaped (for \rho nearer to +1 or -1). In the present graphical display, the inclination of the principal axis of the ellipse changes from +45 degrees to -45 degrees depending on whether \rho is positive or negative. As the simulation begins, this ellipse appears in deep blue background on the plotting axes. Its shape provides a valuable visual link to the numerical value of \rho. When the simulation is complete this ellipse should contain about 95% of the points. Thus, about 5% of the 500 points (about 25) will lie outside the ellipse.

10 As the simulated points are sampled from the bivariate normal distribution they are plotted on the axes, and the updated sample correlation r is printed after every 5 observations. After all 500 points are plotted, the user is asked if he or she wants to do another simulation. The possible responses are K for "Keep the same value of \rho and run again," Y for "Yes, with the opportunity to select a different value of \rho," and N for "No, return to the title page."

11 Not surprisingly, most students who encounter the idea of correlation for the first time have no feel for what the various numbers in the allowable range from -1 to +1 mean. In particular, many expect that any correlation above .9 or below -.9 will correspond to almost perfect fit to a line. Exploring with this program helps to build intuition about sample correlations in several ways:

1. Students can experiment with several runs each of various values of the population correlation. The sample correlation in each case will be fairly near the selected value of \rho. Thus, they can form an impression of what a correlation of .5 or .9 or -.99 looks like.

2. The distinction between population and sample correlation is illustrated. The population correlation \rho is a parameter, which is chosen when one specifies a particular bivariate probability model. The sample correlation is a random quantity, slightly different with each run, resulting from the sampling process. The near agreement of the sample correlation with the population correlation shows that the sample correlation can be used as an estimate of the population correlation.

3. The ellipse in the background ("football" or "cigar" shape) gives students a vivid graphical image to associate with the population correlation.

12 The F1 key shows technical details about the generation of the bivariate normal observations. The Box-Muller method is used to convert pairs of independent uniform random variables, generated easily by the computer, into pairs of independent standard normal random variables, which are then transformed to yield a bivariate observation with the appropriate correlation. (See Zelen and Severo in Sec. 26.8 of Abramowitz and Stegun (1972) or Sec. 6.5.1 of Kennedy and Gentle (1980).) Plotting peculiarities for the special cases where \rho is equal to +1 or -1 are also discussed briefly. The F2 key shows additional options (monochrome operation, if needed for projection, and speed controls). Return to the title page from either the F1 or F2 screen by pressing ESC.

13 Ideally, the following should be done just after the concept of correlation or formula for the coefficient of correlation has been introduced in lecture:

14 Ask students what their reaction would be if you were to run your eye around the classroom and guess that the average height of the students present is about 75 inches (6'3"). Ask what a better guess would be.

15 Recall the Empirical Rule, which says that (for roughly normal data) 95% of the observations will fall within two standard deviations of the mean. Armed with this information, ask them what their reaction would be if you guessed the standard deviation of the weights in the class to be about two pounds. Ask what a better guess would be.

16 Ask for 10-15 student volunteers willing to disclose (truthfully) their heights and their weights. As each student provides this information, plot the point on an overhead projector or blackboard for all to see. Also input these numbers into a computer with projection display running a statistical package such as Minitab, or enter the numbers into a hand calculator. Prepare to reveal the correlation, but do not do so yet.

17 Alternative scenario: Gather the height and weight information at the end of the previous class and have an overhead projection slide and sample correlation figure ready. (I like to appear to live dangerously, but always have a back-up dataset from some previous class just in case not enough volunteers emerge or the computer crashes.)

18 Typically, of course, there is a positive association between heights and weights. Ask what the class thinks would be a good guess for the correlation represented by the scatterplot of heights and weights. If this discussion is taking place when the concept of correlation is very new to the class, you will mainly get blank stares and bad guesses. Reveal the correct answer.

19 Tell students that a major purpose of the computer demonstration to follow in lab is for them to be able to gain enough intuition that they will be able to make rough, but intelligent guesses as to the correlation represented by a scatterplot.

20 Optional additional demonstration: Show overhead projection slides with scatterplots of several bivariate datasets. (These should be scaled so that the height and width of each scatterplot are approximately equal, and so that each scatterplot uses most, but not all, of the height of the slide.) Ask students which slides represent a clear positive association, which represent clear negative association, and which represent no significant association at all. Comparing two datasets that exhibit positive association, point out the one with the larger association and stress that it will have a correlation nearer to +1 than the other.

21 Several datasets are documented in Appendix B and provided with this article. Other datasets might be obtained from the text you are using. The height and weight data collected in class might also be made available. For the larger datasets you will probably want to use Minitab or some other statistical package to produce the scatterplots.

22 In designing a scatterplot, attention needs to be given to the selection of the scales. The amount of unused space surrounding the data cloud can influence perception of the correlation (Cleveland et al. 1982). I think it is a mistake to introduce this complication too early while students are learning to use scatterplots to understand correlation. Thus the program has been designed so that the only changeable parameter is \rho. (Both population means are fixed at 0 and both population standard deviations are fixed at 1.) The parenthetical suggestion just above on scaling scatterplots for classroom demonstration is in this same spirit.

23 After students have had experience using scatterplots and begin to feel comfortable using them to interpret correlation, you may want to show them examples in which the same data are presented on several scatterplots with different scales. (See, for example, Moore 1995, page 112.) My preference for an elementary course is to do just enough of this so that students will be aware that proper scaling is important. Many computer packages for statistical analysis (including Minitab) give you the ability to manipulate scatterplot scales at will. At the suggestion of a referee I have added to my program the capability to double the scale of either or both axes; the blue ellipse (which, of course, is not a feature of ordinary scatterplots where scaling may be an issue) may also be suppressed. (From the Title Page of the program, press F2 for instructions on the use of the scaling and ellipse suppression options.)

24 There is considerable literature on the making and interpretation of scatterplots. Additional references, some dealing with issues beyond correlation, are Strahan and Hansen (1978), Shipp and Margolin (1982), Cleveland and McGill (1984), Raveh (1985), Huber (1987), Lewandowsky and Spence (1989), Meyer and Shinar (1992), and Spence and Garrison (1993).

25 Type PROBDEMO at the DOS prompt and press ENTER at the "copyright" page to get to the "Main Menu," where you should select item 3. To adjust your monitor for the most effective use of this program, press ENTER to start the program and type in the value 0 for "rho". On this adjustment run, ignore the points and concentrate on the blue figure in the background.

(a) Adjust the brightness of your monitor so that this figure is clearly visible, but not really bright, and

(b) if your monitor has a vertical size adjustment, try to adjust it so that the figure is a perfect circle.

26 Notes: (1) The symbol \rho is not conveniently available on the PC screen, so the name of this symbol "rho" is spelled out. We use the symbol \rho in the rest of these lab notes. (2) In your work with this program, you may wish to change the speed at which it runs. Press \ (slow) or / (fast) at the title page or continuation prompt to do so; return to normal speed with |.

27 Type N to return to the title page. On the title page it is suggested that you try the following values of \rho: 0, -.2, .4, -.8, .9, -.99, .999, and 1. Do this. (Press Y after each run to do another run with a different value of \rho.) Then let a neighbor pick several of these values (perhaps changing the sign of some of them), covering the bottom line of the screen so that the values of \rho and r are not visible to you. Can you figure out which values of \rho your neighbor chose for each run? Pick several values of \rho for your neighbor to guess from scatterplots.

28 Some special features of the program may help in such guessing games: If the person guessing does not watch the simulation run, the person running the computer can press L when it is complete to hide the legend on the bottom line. If you are working on your own, return to the Title Page and press G. This will put the program into a mode in which it selects values of \rho for you, and hides the resulting numbers until the run is completed and you have had a chance to make your guess. Also, you can press X, at the Title Page or when asking for the next simulation run, to keep the ellipse from being plotted in the background; you will probably find that this makes guessing more challenging.

29 How accurate are your guesses expected to be? You are not trying to substitute your eyeball for a computer. You are only trying to get a rough idea what various values of r look like in practice. Correlations range from - 1 to +1. For values near +1 and -1 you should be able to guess r correct to the nearest 0.1 or even better; for values nearer to 0 you may be off by as much as 0.2 or even 0.3. If r is far enough from 0 you should have no trouble seeing whether it is positive or negative.

1. Do five runs of the program using any values of \rho you wish (for easiest viewing, it may be best to keep \rho between -.9 and +.9). The blue ellipse shown for each run is supposed to contain 95% of the 500 points. Thus, it is expected that roughly (.05)(500) = 25 points in each run will fall outside the ellipse. For each of the five runs do your best to count the points that fall outside. (For some of the ones near the boundary you may have to guess.) Do your results seem to confirm that about 5% of the points fall outside of the ellipse?

2. The value of the POPULATION correlation \rho that you choose is a constant for any one probability model; it is a fixed "population parameter." Notice, however, that the SAMPLE correlation r is a random variable. It will be different for each sample of 500 observations you select from the population. Why is r different from \rho?

(a) Choose \rho = .9, then do five runs with this same value of \rho (press K after all but the last run), writing down the five values of r that result.

(b) Repeat the steps in (a), but use \rho = .3.

(c) Are your values of r more variable in (a) or in (b)?

(Answer: Of course we cannot say for sure what happened in your particular simulation runs, but the theoretical variance of r [i.e., V(r)] is smaller when \rho = .9 than when \rho = .3. It would be very surprising if your sampling did not produce results in the same direction.)

3. For this exploration put the program into slow mode by pressing \ at the title page or the continuation prompt before starting each run. (You will see the \-symbol in the lower right-hand corner of the screen when you do.) In slow mode the first 20 points are plotted very slowly. For five runs with \rho = .9, try to note the value of r when n is about 10. (Values print only at n = 5, 10, 15, etc. Do not worry if you have to settle for 15 instead of 10 or if you botch a run completely and must try again.) For each of the five runs, note the values of r very early in the run (n = 10 or 15) and at the end of the run (n = 500). Anytime after you have written down the value of r early in each run, you can press | for medium or / for fast speed.

The sample correlation r is an estimator of the population correlation \rho. As with other widely used estimators in statistics, r tends to be a better estimator when it is based on a larger sample. Do your records from the five runs confirm this principle? (Answer: During a run, the values of r tend to fluctuate at first before beginning to settle down to something near \rho. Of course, it is possible for a run to begin with nearly the correct value of \rho and then to drift somewhat afield, but this is much less common.)

30 Note: Questions 4-6 use datasets provided in Appendix B. The instructions below are written for use with Minitab (and the Minitab worksheet versions of the data files), but they can be adapted for use with other statistical packages (and the ASCII versions of the data files). Especially if you are using Minitab, you may want to provide your students with a copy of Appendix C which shows explicitly how to proceed with Question 4.

4. In a study of the concentration of red blood cells in blood samples from newborn babies, two measures of red blood cells were used: The "hematocrit" is a measure of the percent by volume of blood that consists of red blood cells. The "hemoglobin" is a quantitative chemical analysis for the protein hemoglobin, the substance that gives these cells their characteristic red color. These are two ways to try to attach quantitative measures to the concept "concentration of red blood cells." They should be highly correlated. Anemic babies (ones with not enough red blood cells) should be low on both scales; polycythemic babies (ones with too many red cells) should be high on both scales.

In Minitab, retrieve the worksheet REDCELL.MTW and make a scatterplot of hematocrit (HCrit) against hemoglobin (Hgb). Try to find the center of gravity of the data cloud by estimating the sample means of each variable and finding the corresponding point on the plot. Try to imagine an ellipse centered there, of the right shape to match the data cloud, and of the right size to contain 95% of the data points---all but perhaps a couple of them. Try to make a rough intuitive guess as to the value of r. (Write down your guesses for the means and the correlation before you continue. It would be very surprising if your guesses were exactly correct, but you won't develop your ability to make educated guesses without practicing.)

Finally, use Minitab to find the means of the two variables and the correlation between them. How well did you locate the center of gravity? How good was your estimate of r?

5. The worksheet EUROPREC.MTW contains annual precipitation (in mm) for three European cities, Manchester, Paris, and Madrid, for the 100-year period 1870-1969. "Precipitation" is rainfall plus snowfall converted to equivalent amounts of rain. Weather patterns in Europe would lead to the supposition that rainfall for Manchester and Paris will show a significant correlation, whereas precipitation for Manchester and Madrid (distant cities in different climates) would not. Follow the same procedures (scatterplot, guessing, correlation, etc.) as in Question 4 twice: once for Manchester-Paris and once for Manchester-Madrid. What values of r did you guess? What are the actual computed values? Which pair of cities shows the highest correlation?

6. The worksheet RAINGRAD.MTW shows high school graduation rates (in percents) and typical annual precipitation (in inches) for the 50 United States plus the District of Columbia. Follow the same procedures as in Question 4 for plotting and guessing r. Can you think of a mechanism or rationale to explain this correlation? If so, write down your speculation. If not, write an explanation of how you think it was possible to find data with a value of r so far from 0. (Do this before you look at the answer.)

(Answer: Here is how the data were actually obtained: For such a small sample size (here n = 51) it does not take much looking around through an almanac to find two variables that happen to show a correlation quite different from 0. It would be very difficult to imagine a causative link between rainfall and high school graduation rates.)

31 Students near the end of a first calculus-prerequisite course in probability theory or students in a second probability course should be able to demonstrate the validity of the Box-Muller transformation and of the transformation used to obtain bivariate normal observations (as presented briefly on the page that shows when F1 is pressed from the title page of Part 3). These exercises are phrased formally below as Advanced Problems 1 and 2. Students can also be expected to derive the equation for the ellipse that contains 95% of the probability---Advanced Problems 3 and 4 below.

Advanced Problem 1.Let U and V be independent random variables, each distributed uniformly on the interval [0, 1). Show that the random variables W and X defined below are independent random variables, each distributed standard normal:

W = \sqrt{(-2 ln U)} \sin{(2 \pi V)},

X = \sqrt{(-2 ln U)} \cos{(2 \pi V)}.

Advanced Problem 2.If W and X are independent random variables, each distributed standard normal, and if Y is defined as below, then show that (X,Y) has a bivariate normal distribution with E(X) = E(Y) = 0, V(W) = V(Y) = 1, and correlation \rho:

Y = \rho X + \sqrt{(1 - {\rho}^2)} W.

Advanced Problem 3.With W and X defined as in Advanced Problem 2, show that the random vector (W,X) falls inside the circle

W^2 + X^2 = 5.99

with probability .95.

Advanced Problem 4.Use the result of Advanced Problem 3 and the definitions of X and Y in Advanced Problem 2 to find the equation of the ellipse that contains the point (X,Y) with probability .95.

32 I usually use this program in class as soon as the idea of correlation has been introduced. Some elementary books introduce the sample correlation r without mentioning the population correlation \rho. In this case, I explain briefly in lecture that just as \mu is the population parameter corresponding to \bar{X} (quantifying centrality) and \sigma is the population parameter corresponding to s (dispersion), so---for bivariate data---\rho is the population parameter corresponding to r (association). Later, when the population correlation has been formally introduced, we take another look at the program, emphasizing the distinction between the population parameter and the statistic, which can be used as its estimate.

33 Finally, towards the end of a first calculus-based probability course or in a second course, I have found that students are motivated to look at the mathematics behind the program, as outlined in the advanced problems given above.

34 A demonstration somewhat parallel to this program, but not quite as graphically elegant or easy to use, can be done using Minitab (see Appendix A). This demonstration uses Minitab's procedure for generating standard normal random variables, so the procedure for obtaining standard normal variates from uniform ones is not explicitly displayed.

35 Note: Instructors interested in a comprehensive collection of simulation experiments in statistics using Minitab may wish to consider Keller (1994). Although this book has no experiments specifically dealing with correlation, and its primary emphasis is computational rather than graphical, many of the author's purposes are similar to the ones that prompted my programs. Spurrier et al. (1995) also contains laboratory material appropriate for elementary statistics and probability courses.

36 Keno is a lottery game played in many gambling casinos. Typically, 80 balls numbered from 1 through 80 are agitated in an air stream by a machine so that 20 of them can be selected at random during play. Before play begins the gambler marks a ticket printed with the numbers from 1 through 80 in an effort to predict some of the numbers that will be among the 20 drawn. Many variations of the game are possible, but we concentrate here on the simplest, in which the gambler decides to "mark" a certain number of "spots" (i.e., predict a certain number of balls that will be selected); for us the number of predictions must be between one and nine.

37 After the 20 balls are selected by the machine, the casino notes how many "hits" (successful predictions) the gambler has accomplished. Each casino publishes lists of payoffs that depend on the amount bet, the number of spots marked, and the number of hits achieved. Relatively small proportions of hits receive no payoff. Relatively large proportions of hits (although extremely unlikely) can yield very large payoffs. The payoff schedules used in this program are ones recently advertised by Harrah's casinos in Reno and South Lake Tahoe, Nevada. The maximum possible payoff, for perfect tickets with large numbers of spots marked, is $50,000. Here is the payoff schedule for $2 bet on a game with only five spots marked, in which the maximum payoff is $1640.

Number Corresponding of Hits Payoff ---------------------------- 0 0 1 0 2 0 3 2 4 18 5 1640

38 One goal of the casino, evident from the way Keno games are promoted, is to get the gambler to focus on the largest possible payoff. The minuscule probabilities of such bonanzas are not advertised. Another goal seems to be to foster the illusion among gamblers that Keno is a game of strategy and skill in which there is some advantage in making wise guesses. It would be illegal to make this claim forthrightly. (But it is not illegal to encourage players to give very careful consideration to the numbers they pick, nor to act as if it is worthwhile to go through this thought process afresh for each new game.) Honestly played, Keno is a game of pure chance owing to the randomness of the selection of balls.

39 The probabilities of each number of hits can be computed using the hypergeometric probability distribution. The total number of outcomes from the drawing is the combinations of 80 things taken 20 at a time: C(80, 20) = 3.54 x 10^18. The number of ways to achieve exactly four hits is the number of ways to pick four hits out of five spots marked, C(5, 4) = 5, times the number of ways to pick 16 non-hits out of the 75 spots not marked, C(75, 16) = 8.55 x 10^15. The product is 4.28 x 10^16. Thus, the probability of getting exactly four hits is 4.28/354 = .01209.

40 Similar computations are used to fill in the rest of the probabilities in the table below for our example of a five-spot game, on which $2 is bet.

Hits Payoff Probability Product ------------------------------------- 0 0 .22718 0.00 1 0 .40569 0.00 2 0 .27046 0.00 3 2 .08394 0.17 4 18 .01209 0.22 5 1640 .00064 1.06 --------------------- 1.00000 1.44

41 The expected amount won E(W) is computed as the sum of the products shown in this table: $1.44. Thus, in our example, the expected return on a bet of $2.00 is about $1.44. It is typical of most Keno games that the gambler loses on average a little more than 25% of the amount bet. Based on the criterion of the expected percentage of the bet lost on each play, Keno is among the least favorable among the common "honest" games of pure chance legally available anywhere in the United States. Only the state lotteries with their payoffs of approximately 50% are worse, but with these there is the hope that the proceeds will be used more or less efficiently for some public good.

42 One can only speculate on the popularity of a game with such miserable odds. One reason is surely the possibility (if not the probability) of winning big for only a small bet. Many players will have a vivid image in their minds of how they would feel if they "win big" and what they might do with the money. They may feel a thrill as each number is drawn and posted. Another reason might be that the game requires a minimum of knowledge or concentration---or even sobriety---to play. In fact, "Keno runners" are available throughout the casinos, even in restaurants and cocktail lounges, to submit marked Keno tickets for play, and the results of each game appear on ubiquitous screens. A third reason might be that it is possible to imagine that one's failure on the game just finished was, nevertheless, "nearly" a success. "If only I had picked 15 instead of 25" (15 instead of 14, 15 instead of 16, 15 instead of 5, Aunt Sue's birthday which is the 15th instead of Uncle Dan's birthday which is the 27th, etc.) There are so many ways to imagine that one "almost" won that the probability of "almost" winning (if quantifiable at all) may be quite large indeed.

43 The program in Part 4 gives students the opportunity to try their luck at a simulated game of Keno. They begin with a "grant" of $20 in play money and can continue playing until it is lost. Then it is easy enough to start the program again with a fresh $20 stake. The illusion that it makes a difference what numbers are chosen is maintained to an extent by the ability to make changes until the ticket is marked just the way the player desires. However, unlike in the casino atmosphere, the probabilities of winning and the expected loss are clearly posted for each game.

44 From the title page the user has the option to start play at once or to read an introduction in which the game is explained and an indication (much less thorough than provided above in this paper) is given as to how probabilities are computed. Since several pages of explanation are provided in the introduction, there is no F1-page to provide technical details.

45 Since the look and feel of this program depends on the display of color text and since different computers with monochrome displays treat text so diversely, I could see no feasible way to write a program that would use a monochrome display with appealing and predictable results. Also the game seems to lose something if the speed is changed much either way. For these reasons there is no F2-page offering monochrome or speed options. For similar reasons, no Minitab analogue is offered.

46 Some preliminary words of caution are necessary. Somehow computers and recreational games seem to go together in the minds of students at all levels. Before introducing ANY demonstration into a statistics or probability lab, it is important to make sure that there is something tangible to be gained and that adequate thought is given to preventing confusion or abuse. In the case of a game, however, an extra degree of foresight may be necessary.

47 Before using this game in class it is important to understand exactly what you expect to accomplish by its presentation and to make sure students understand what this is. Written instructions on how to proceed and a requirement for a written report on what was accomplished are especially important here. It is also wise to make Keno the last item presented at a particular lab session so that there is a natural ending point to playing it.

48 One approach for introducing the Keno program into an elementary or intermediate probability course is to explain in lecture how Keno is played and to present the payoff table for some specific instance, such as the one given above for a $2 bet on a five-spot game. (Any relevant payoff table can be obtained by running the program.) Ask if anyone has ever played the game in a casino. Ask students if it looks like a game they think they could win. Ask how they would judge whether the game is "fair" and what they mean by fair.

49 If the idea of using expectation as a criterion for fairness does not emerge, propose a simpler lottery in which 1000 tickets have been sold and there is only one prize---$1500 for the single winner. Does $1.50 emerge as the "fair" price for a ticket in such a lottery?

50 Depending on how the discussion goes and on the interests of the instructor, there might be some room at this point for a brief and intuitive mention of the idea commonly called "utility" by game theorists. (A presentation of the formal ideas would not be difficult, but would take the discussion away from its main purpose.) Here are some points that might be profitably mentioned: Some people may consider $1.50 to be too small to be of any value, except for the momentary entertainment value it might have. After all, people put such sums of money into video games with NO chance of a monetary return. On the other hand, $1500 might seem to be a large enough sum to buy them some truly valued item. For such a person $1500 might be worth more than a thousand times $1.50---formally, a "non-linear utility function." If so, $1.50 might seem like a bargain price for the lottery ticket. What about such a person who gets carried away and participates in 1000 lotteries within some short span of time---the beginnings of gambling fever?

51 Next, whether or not the hypergeometric distribution is covered in the text for the course, one might show from combinatorial principles how to compute several of the probabilities that go with the payoff table. The additional probabilities can be supplied without computation and left as an exercise. Then the expected winnings can be computed, and the profitability of the game for the casino can be discussed.

52 It might also be worth showing an overhead projection slide of a Keno ticket, using the layout on the computer screen as a model. Show a game with five spots marked. Then show an overlay with 20 balls selected that produce only three hits but lots of "near" hits in the immediate vicinity of spots marked. Would students feel that they had "almost" won $18---or maybe even $1640---and be encouraged to try again?

53 Finally, one might discuss the public perception of how often gamblers win compared with the known, computable probabilities of winning. How likely is someone who does win big at Keno to let everyone know? How likely is someone who gambles all weekend and loses heavily to advertise his or her lack of success? Certainly, within the confines of the casino, games with no winners are quietly ignored and the next game is started at once. A game with a big winner is the subject of hoopla for hours if not days or weeks in every medium of exposure available to the casino.

54 (It might also be added here that hoopla governs the public perception of events other than gambling. The news media cover the most interesting events, and do not emphasize more common but uninteresting occurrences. What are the real chances of being devastated by an earthquake in California? Is it more dangerous to fly from Chicago to Atlanta than it is to drive? Statistical methods based on random sampling are necessary in the pursuit of truth precisely because our intuitive data gathering about the relative frequencies of even simple events is so easily biased.)55 Type PROBDEMO at the DOS prompt and then press ENTER at the "copyright" page to get to the "Main Menu," where you should select Item 4. Begin by reading the introduction. Then select a $2 bet on a five-spot game. Before you play the game notice the number of hits that is most likely. In the long run, over 40% of five-spot games will result in one hit (which yields no payoff).

56 Continue by playing several more five-spot games with $2 bets. Keep track of the number of hits you get on each game. (Also keep track of the number of games with no payoff in which you feel you "almost won" either an $18 or a $1640-payoff.) Compare notes with other students nearby. Does the 40% figure seem reasonable? (According to your definition of "almost winning," what percentage of the games fall into that category?)

57 Pedagogical Note: The Keno game works well at a variety of academic levels. Letters in parentheses indicate the level of difficulty: E for elementary, M for intermediate, and A for advanced.

58 Questions 1-4 refer to $2 five-spot games.

1. (E) If you play one game, what are the chances of losing your $2? (Answer: .22718 + .40569 + .27046 = .90333.)

2. (E) If you started with $20,000 and played 10,000 games, about how much money would you expect to have left? (Answer: $14,400.)

3. (M) In Problem 2, what is the probability that you will lose all of your money? (Answer: (.90333)^10,000 or very nearly zero.)

4. (M) Find the variance of the amount won in playing one game. (Answer: Using rounded numbers from the table in the program, we have E(W^2) = 4(.08394) + 324(.01209) + 2,689,600(.00064) = .33576 + 3.91716 + 1721.344 = 1725.59692, V(W) = 1725.60 - 1.44^2 = 1723.)

5. (M) (Continuation of 2 and 4) Find the standard deviation of the amount won in playing 10,000 independent games. Suppose that the Empirical Rule holds so that there is about a 95% chance that you will wind up with an amount of money that is within two standard deviations of the expected winnings. What interval of likely winnings does this give? (Answer: For the 10,000 games the variance is 10,000 times the answer to Question 4. The square root of this is $4151. The interval is $14,400 plus or minus twice $4151 or about $6100 to $22,700. This answer cannot be exact because we are using some rounded numbers from the computer screen for input.)

6. (E) Suppose you play a two-spot game and mark the spots 13 and 66. What is the probability that you will get exactly one hit? (Answer: From the program: .37975. Computed: C(2,1) C(78,19) / C(80,20).

7. (M) If you bet $2 on the game in Question 6, what payoff for two hits would make this a fair game? (Answer: The probability of two hits is .06013. Since fewer hits will not pay, we need the product of this probability and the payoff to be equal to the $2 bet; $33.26 comes very close.)

8. (A) (Continuation of 6) Further suppose in the two-spot game in Question 6 that you consider the numbers 3, 12, 14, and 23 to be "near" to 13, and the numbers 56, 65, 67, and 76 to be "near" to 66. What is the probability that you will get exactly one hit, but feel that one or more other numbers drawn "nearly" gave you a second hit? (Answer: First, we find the probability of exactly one hit and EXACTLY ONE near hit. In the numerator select which of the two marked numbers is the hit, select one of four allowable numbers for the near hit, and then select the 18 other numbers: thus, the probability of exactly one hit and exactly one near hit is:

C(2,1) C(4,1) C(74,18) / C(80,20) = .1644.

Probabilities of one hit along with exactly two, three, and four near hits are computed similarly. The total probability of one hit and one or more near hits is .2585. Over a quarter of the time you will think you "nearly" won.)

59 I have used this program most often in elementary probability courses as soon as the concept of expectation for discrete distributions has been introduced. I have used it successfully even when the text does not include an explanation of the hypergeometric distribution. I either offer my own brief treatment keyed specifically to Keno, or just say that it is possible to compute the probabilities, provide them without proof, and then focus on how they are used to find the expected winnings. (In the latter case, several students usually insist on private outside-of-class explanations of how to find the probabilities.)

60 Keno can be used at a much lower academic level than the other programs in this package. At California State University at Hayward we have a variety of periodic events in which high school students and junior high school students come to campus (usually on a Saturday) for a day of classes, demonstrations, and exhibits. Some of these events are targeted at disadvantaged students or gifted students, and sometimes any students in the area and their parents are invited. The Keno program has become a standard attraction for these events. I admit that attempts to accompany it with some appreciation of the probability principles involved meet with varying degrees of success depending on the audience. The sessions always have waiting lines, even with 20 or more available computer stations, and they are never boring. I see no harm in this sort of carnival atmosphere if one goes into it with eyes open and worthy ulterior motives in mind.

61 The programs presented here can be used in an introductory probability course to build student interest and understanding of the concepts of randomness, expected value, and correlation. The programs' most effective use is in an interactive laboratory setting, with (1) adequate introduction and guidance to focus attention on the concepts to be learned, (2) an opportunity for students to compare and discuss results, and (3) a structure for students to provide written answers to relevant questions and to summarize what they have learned.

Preparation of this article and development of the programs provided with it were partially supported by NSF Grant USE 91-50433 and by California State University, Hayward. Newton Wai, a statistics graduate student at Cal State Hayward, has read drafts of this paper and made helpful suggestions. The author also wishes to thank the editor and the referees for their careful and helpful comments.

Computer programs were compiled using Microsoft QuickBasic (Version 4.0). The utility BRUN40.EXE, which must be present to run the programs, is property of Microsoft Corporation and is used with permission.

MINITAB Alternative to Part 3

The attached programs should run on any IBM-PC compatible machine with a display that is EGA or better. In a Windows environment it is usually possible to run DOS programs; many recent Macintosh machines are also able to emulate or run DOS. However, for those users who need or prefer to use Minitab, macros are included that capture most of the spirit of Part 3. (Values of r based on only part of a simulation, such as those used in Question 3 of Part 3, are not available in the Minitab macros.) The stored programs BIVCOR.MTB, BIVCOR1.MTB, and BIVCOR2.MTB are intended to be used together and should work in most pre-Windows releases of Minitab. The stored programs WBIVCOR.MTB, WBIVCOR1.MTB, and WBIVCOR2.MTB will run on Minitab for Windows.

(1) The three non-Windows Minitab stored programs were written using Release 7 for DOS. BIVCOR2.MTB uses the GPLOT command with a LINE subcommand, neither of which is supported by Windows versions. WBIVCOR2.MTB uses the new Windows PLOT * command and the new LINE subcommand that goes with it. All of the Minitab stored programs are annotated with comments (following #-symbols) that explain the key steps.

(2) To operate these programs make sure Minitab addresses the path that contains them (use the CD command, "change drive/disk," if necessary), and at the MTB > prompt type the appropriate one of the following commands:

MTB > exec 'bivcor' or MTB > exec 'wbivcor'

The appropriate command must be repeated for each run.

Documentation of Data Sets in Part 3

Three datasets are presented; each is provided in three formats. The format with the extension .MTW is a worksheet in the format produced by Minitab Release 7 for DOS and readable by many other DOS and Windows versions of Minitab. The extension .MTP indicates a Minitab worksheet saved using the PORTABLE subcommand and retrievable by all versions of Minitab. The extension .dat.txt indicates DOS ASCII format with unlabeled variables, which appear in columns specified below for each dataset. These can be used with almost any statistical computer software.

Data Set 1. Data are taken from Herzog and Felton (1994) based on blood samples from 43 newborn babies at a Northern California community hospital. Variables are labeled HCrit (hematocrit in percent) and Hgb (hemoglobin in grams per deciliter). Data files are named REDCELL.MTB, etc. In the ASCII file, hematocrit appears in columns 1-4, hemoglobin in columns 8-11. Both variables carry one digit after the decimal point.

A printout of a Minitab session in response to Question 4 of Part 3 is shown in Appendix C.

Data Set 2. Data are taken from Mitchell (1975). Variables are Year (1870-1969), Manchester (Manchstr), Paris, Madrid. The columns for cities give annual precipitation in millimeters. Data files are named EUROPREC.MTW, etc. In the ASCII file, Year is in columns 1-4, Manchester in columns 6-9, Paris in columns 11-15, and Madrid in columns 17-20. All variables are integer-valued. I thank Jason Stover, a graduate student in statistics at California State University, Hayward, for calling these data to my attention.

Data Set 3. Data are taken from The World Almanac and Book of Facts (1994). Variables are State (two letter abbreviation), typical annual rainfall (data in inches, credited to National Climatic Data Center, U.S. Department of Commerce), and percentage graduation rates from public high schools (credited to National Center for Educational Statistics, U.S. Department of Education). Files are named RAINGRAD.MTB, etc. In the ASCII file, State appears in columns 1-2, Rainfall in columns 8-9, Graduation Rate in columns 16-17. State is an alphanumeric variable and the other two are integer variables.

All three of these datasets are taken from Trumbo (forthcoming), a compendium of classroom materials based the exploration of real datasets.

Sample MINITAB Run for a Data Set Used in Part 3

This Appendix presents a printout for Question 4 of Part 3 made using Minitab Release 7 for DOS. A character graphics scatterplot was used here so that results could be printed using ASCII characters. A higher quality plot could be obtained using the Minitab command GPLOT (or the new PLOT command or menu selection in Minitab for Windows).

**Technical notes:** (1) To be readable, this output should be printed
in a format with line widths of at least 76 characters and in a
mono-spaced font (such as Courier). (2) It is assumed that the Minitab
worksheet is located in the path C:\JSE.

MTB > retr 'c:\jse\redcell' [output confirming successful retrieval suppressed] MTB > plot c1 c2 - HCrit - - * ** * - 60+ 3 * * - * * - * 2 * - * * - 2222 50+ 2* 3 * - * * 2 - * - * - 2 40+ * * * - - --------+---------+---------+---------+---------+--------Hgb 14.0 16.0 18.0 20.0 22.0 MTB > desc c1 c2 N MEAN MEDIAN TRMEAN STDEV SEMEAN HCrit 43 52.43 52.10 52.44 6.62 1.01 Hgb 43 17.826 17.500 17.810 2.292 0.349 MIN MAX Q1 Q3 HCrit 39.40 64.70 48.80 57.00 Hgb 13.400 22.700 16.600 19.800 MTB > corr c1 c2 Correlation of HCrit and Hgb = 0.990

Analogous Minitab sessions should be performed in response to Questions 5 and 6 in Part 3.

Cleveland, W. S., Diaconis, P., and McGill, R. (1982), "Variables on Scatterplots Look More Highly Correlated When the Scales Are Increased," Science, 216, 1138-1141.

Cleveland, W. S., and McGill, R. (1984), "The Many Faces of a Scatterplot," Journal of the American Statistical Association, 79, 807-822.

Herzog, B., and Felton, B. (1994), "Hemoglobin Screening for Normal Newborns," Journal of Perinatology, 14(4), 285-289.

Huber, P. J. (1987), "Experiences With Three-Dimensional Scatterplots," Journal of the American Statistical Association, 82, 448-453.

Keller, G. (1994), Statistics Laboratory Manual: Experiments Using Minitab, Belmont, CA: Duxbury Press.

Kennedy, W. J., Jr., and Gentle, J. E. (1980), Statistical Computing, New York and Basel: Marcel Dekker, Inc.

Lewandowsky, S., and Spence, I. (1989), "Discriminating Strata in Scatterplots," Journal of the American Statistical Association, 84, 682-688.

Meyer, H., and Shinar, D. (1992), "Estimating Correlations From Scatterplots," Human Factors, 34, 335-349.

Mitchell, B. R. (1975), European Historical Statistics, New York: Columbia University Press.

Moore, D. S. (1995), The Basic Practice of Statistics, New York: W. H. Freeman and Company.

Raveh, A. (1985), "On Quick Estimates of Pearson's r From Scatter Diagrams [Letter]," The American Statistician, 39, 239-240.

Shipp, C. E., and Margolin, C. G. (1982), "Graphical Display of Scatter Data Using the Standard Deviation Ellipse," in Proceedings of SAS Users Group International Conference, 7, pp. 171-175.

Spence, I., and Garrison, R. F. (1993), "A Remarkable Scatterplot," The American Statistician, 47, 12-19.

Spurrier, J. D., Edwards, D., and Thombs, L. A. (1995), Elementary Statistics Laboratory Manual, Belmont, CA: Duxbury Press.

Strahan, R. F., and Hansen, C. J. (1978), "Underestimating Correlation From Scatterplots," Applied Psychological Measurement, 2, 543-594.

Trumbo, B. E. (1994), "Some Demonstration Programs for Use in Teaching Elementary Probability: Parts 1 and 2," Journal of Statistics Education, v.2, n.2.

Trumbo, B. E. (forthcoming), Exploring Real Data With Minitab (provisional title), Belmont, CA: Duxbury Press.

The World Almanac and Book of Facts (1994), Mahwah, NJ: World Almanac Books.

Zelen, M., and Severo, N. C. (1964), "Probability Functions," in Handbook of Mathematical Functions (1974 ed.), eds. M. Abramowitz and I. A. Stegun, U.S. Department of Commerce, National Bureau of Standards, Applied Mathematics Series #55 (Tenth Printing with corrections), Washington: U.S. Government Printing Office, pp. 927-995.

Bruce E. Trumbo

Department of Statistics

California State University, Hayward

Hayward, CA 94542

**
Download
Programs to a Local File**

To unpack the files BRUN40.EXE, PROBDEMO.EXE, PDMEN.EXE, PDLLN.EXE, PDPPR.EXE, PDBNS.EXE, and PDKEN.EXE, type prob02 at the DOS prompt.

** Download Minitab Programs and Datasets
to a Local File
**

Return to Table of Contents | Return to the JSE Home Page