B. Examples of assessment Items
Assessment items to avoid using on tests: True/False, pure computation without a context
or interpretation, items with too much data to enter and computer or analyze,
items that only test memorization of definitions or formulas.
We first give some examples
of assessment items with problems and commentary about the nature of the difficulty
(1)
A teacher taught two sections of elementary statistics
last semester, each with 25 students, one at
Critique: The teacher has all of the population data so there is no need to do
statistical inference.
(2) An economist wants to compare the mean salaries for male and female CEOs . He gets a random sample of 10 of each and does a t-test. The resulting p-value is .045.
(a) State the null and alternative hypotheses.
(b) Make a statistical conclusion.
(c) State your conclusion in words that would be understood by someone with no training in statistics.
Critique: The question doesn’t address the conditions necessary for a t-test,
and with the small sample sizes they are almost surely violated here. Salaries
are almost surely skewed.
(3) Which of the following gives the definition of a p-value?
(a) It’s the probability of rejecting the null hypothesis when the null hypothesis is true.
(b) It’s the probability of not rejecting the null hypothesis when the null hypothesis is true.
(c) It’s the probability of observing data as extreme as that observed.
(d) It’s the probability that the null hypothesis is true.
Critique: None of these answers is quite correct. Answers (b) and (d) are
clearly wrong; answer (a) is the level of significance and answer (c) would be
correct if it continued “... or more extreme, given that the null hypothesis is
true.”
Examples showing
ways to improve some assessment items:
True/false items, even when well written, do
not provide much information on student knowledge because there is always a 50%
chance of getting the item right without any knowledge of the topic. One current approach is to change the items
into forced-choice questions with three or more options. For example,
(4)
The size of the standard deviation of a data set depends on where
the center is. True of False
Changed to:
(4)
Does the size of the
standard deviation of a data set depend on where the center is located?
(a)
Yes, the higher the mean, the higher the standard deviation.
(b)
Yes, because you have
to know the mean to calculate the standard deviation.
(c)
No, the size of the standard
deviation is not affected by the location of the distribution.
(d) No, because the standard deviation only measures how the values differ from each other, not how they differ from the mean.
(5)
A correlation of +1 is
stronger than a correlation of -1. True
or False
Rewritten as:
(5)
A recent article in an
educational research journal reports a correlation of +.8 between math achievement and overall
math aptitude. It also reports a correlation of
-.8 between math achievement and a math anxiety test. Which of the following interpretations is the
most correct?
(a) The correlation of +.8 indicates a stronger relationship than the correlation of -.8
(b)
The correlation of +.8
is just as strong as the correlation of -.8
(c)
It is impossible to
tell which correlation is stronger
Context is important for helping students see
and deal with statistical ideas in real world situations.
(6)
Once it is established
that X and Y are highly correlated, what type of study needs to be done in
order to establish that a change in X causes a change in Y?
A context is added:
(6)
A researcher is
studying the relationship between an experimental medicine and T4 lymphocyte
cell levels in HIV/AIDS patients. The T4 lymphocytes, a part of the immune
system, are found at reduced levels in patients with the HIV infection. Once it
is established that the two variables, dosage of medicine and T4 cell levels,
are highly correlated, what type of study needs to be done in order to
establish that a change in dosage causes a change in T4 cell levels?
(a) correlational study
(b)
controlled experiment
(c)
prediction study
(d)
survey
Try to avoid
repetitious/tedious calculations on exams that may become the focus of the
problem for the students at the expense of concepts and interpretations.
(7) A First Year Program course used a final exam that contained a 20 point essay question that asked students to apply Darwinian principles to analyze the process of expansion in major league sports franchises. To check for consistency in grading among the four professors in the course a random sample of six graded essays were selected from each instructor. The scores are summarized in the table below. Construct an ANOVA table to test for a difference in means among the four instructors.
Instructor Scores
---------- ---------------
Affinger 18 11 10 12 15
12
Beaulieu 14 14 11 14 11
14
Cleary 19 20 15 19 19 16
Dean 17 14 17 15 18 15
Critique: The version
of the question above requires a fair amount of pounding on the calculator to
get the results and never even asks for an interpretation. The revision below still requires some
calculation (which can be adjusted depending on the amount of computer output
provided) but the calculations can be done relatively efficiently - especially
by students who have a good sense of what the computer output is providing.
(7) A First Year Program course ... (same intro as above) ... The scores are summarized in the table below, along with some Descriptive Statistics for the entire sample and a portion of the Oneway ANOVA output.
Instructor Scores -------- ------------- Affinger 18 11 10 12
15 12 Beaulieu 14 14 11 14
11 14 Cleary 19 20 15 19 19 16 Dean 17 14 17 15 18 15
Descriptive Statistics
Variable N
Mean Median TrMean
StDev SEMean
Score 24
15.000 15.000 15.000
2.919 0.596
One-way Analysis of Variance
*** ANOVA TABLE
OMITTED ***
Individual
95% CIs For Mean
Based on
Pooled StDev
Level
N Mean StDev
------+---------+---------+---------+
Afinger 6
13.000 2.966 (------*------)
Beaulieu 6
13.000 1.549 (------*------)
Cleary 6
18.000 2.000 (------*------)
Dean 6
16.000 1.549 (------*------)
------+---------+---------+---------+
Pooled StDev = 2.098 12.5 15.0
17.5 20.0
(a) Unfortunately, we are missing the ANOVA table from the Minitab output. Use the information given above to construct the ANOVA table and conduct a test (5% level) for any significant differences among the average scores assigned by the four instructors. Be sure to include hypotheses and a conclusion. If you have trouble getting one part of the table that you need to complete the rest (or the next question), make a reasonable guess or ask for assistance (for a small point fee).
(b) After completing the ANOVA table, construct a 95% confidence interval for the average score given by Dr. Affinger. Note: Your answer should be consistent with the graphical display given by Minitab.
Some additional examples of good
assessment items
(8) Let
Y denote the amount a student spends on textbooks for one semester. Suppose Nancy, who is statistically savvy,
wants to know how fall, semester 1, and spring, semester 2, compare. In particular, suppose she is interested in
the averages m1 and m2.
You may assume that
Rudd says “Conduct an a=.05 test of H0: m1=m2 vs. HA: m1≠m2 and tell
Linda says “Report a 95% confidence interval for m1 - m2 .”
Steve says “Conduct a test of H0: m1=m2 vs. HA: m1≠m2 and report to
Gloria says “Compare to . If > then test H0: m1=m2 vs. HA: m1>m2 using a=.05 and tell
Rank the 4 pieces of advice from worst to best and explain why you rank them as you do. That is, explain what makes one better than another.
(9) Researchers took random samples of subjects from two populations and applied a Wilcoxon-Mann-Whitney test to the data; the P-value for the test, using a non-directional alternative, was .06. For each of the following, say whether the statement is True or False and say why.
(a) There is a 6% chance that the two population distributions really are the same.
(b) If the two population distributions really are the same, then a difference between the two samples as extreme as the difference that these researchers observed would only happen 6% of the time.
(c) If a new study were done that compared the two populations, there is a 6% probability that H0 would be rejected again.
(d) If a = .05 and a directional alternative were used, and the data departed from H0 in the direction specified by the alternative hypothesis, then H0 would be rejected.
(10) An article on the CNN web page on Monday (http://www.cnn.com/HEALTH/9612/16/faith.healing/index.html) begins with the sentence "Family doctors overwhelmingly believe that religious faith can help patients heal, according to a survey released Monday." Later the article states "Medical researchers say the benefits of religion may be as simple as helping the immune system by reducing stress" and Dr. Harold Koenig is reported to say that "people who regularly attend church have half the rate of depression of infrequent churchgoers."
Use the language of statistics to critique the statement by Dr. Koenig and the claim, suggested by the article, that religious faith and practice help people fight depression. You will want to select some of the following words in your critique: observational study, experiment, blind, double-blind, precision, bias, sample, spurious, confounding, causation, association, random, valid, reliable.
(11) Francisco Franco (Class of '98) weighed 100 Hershey's Kisses (with almonds). He found that the sample average was 4.80 grams and the SD was .28 grams. In the context of this setting, explain what is meant by the sampling distribution of an average.
(12) A gardener wishes to compare the yields of three types of pea seeds - type A, type B, and type C. She randomly divides the type A seeds into three groups and plants some in the east part of her garden, some in the central part of the garden, and some in the west part of the garden. Then she does the same with the type B seeds and with the type C seeds.
(a) What kind of experimental design is the gardener using?
(b) Why is this kind of design used in this situation? (Explain in the context of the situation.)
(13) The following scatterplot shows how divorce rate, y, and marriage rate, x, are related for a collection of 10 countries. The regression line has been added to the plot.

(a)
The
(b)
Think about the scatterplot and regression line after
the
(14) Researchers
wanted to compare two drugs, formoterol and salbutamol, in aerosol solution, to
a placebo for the treatment of patients who suffer from exercise-induced
asthma. Patients were to take a drug or
the placebo, do some exercise, and then have their "forced expiratory
volume" measured. There were 30
subjects available. (Based on A.N. Tsoy, et al., European Respiratory Journal 3 (1990): 235; via Berry, Statistics: A Bayesian Perspective.)
(a)
Should this be an experiment or an observational
study? Why?
(b)
Within the context of this setting, what is the
placebo effect?
(c) Briefly explain how to set up a randomized blocks design (RBD) here.
(d) How would an RBD be a helpful? That is, what is the main advantage of using a RBD in a setting like this?
(15) I noticed that 8 students from the 114 class attended the review session prior to the second exam (in April). The average score among those 8 students was lower than the average for the 21 students who did not attend the review session. Suppose I want to use this information in a study of the effectiveness of review sessions.
(a) What kind of study is this: observational or experimental? Why?
(b) What kind(s) of sampling error(s) or bias(es) might be of concern here?
(c) (Hypothetical) I gave the data for the 8 who attended and for the 21 who did not attend to my friend George. George used the data to conduct a hypothesis test. Does a hypothesis test make sense? If so, what is H0? If not, why not?
(16) For each of the following three settings, state the type of analysis you would conduct (e.g., one-sample t-test, regression, Chi-square test of independence, Chi-square goodness-of-fit test, etc.) if you had all of the raw data and specify the roles of the variable(s) on which you would perform the analysis, but do not actually carry out the analysis.
(a) Elizabeth Larntz (Class of ‘02) measured the effect of exercise on pulse for each of 13 students. She measured pulse before and after exercise (doing 30 jumping jacks) and found that the average change was 55.1 and the SD of the changes was 18.4. How would you analyze the data?
(b) Three HIV treatments were tested for their effectiveness in preventing progression of HIV in children. Of 276 children given drug A, 259 lived and 17 died. Of 281 children given drug B, 274 lived and 7 died. Of 274 children given drug C, 264 lived and 10 died. How would you analyze the data?
(c) A researcher was interested in the relationship between forearm length and height. He measured the forearm lengths and heights of each of 16 women. How would you analyze these data?
(17) I had Data Desk construct parallel dotplots of the data from four samples. I then conducted a test of H0:µ1=µ2=µ3=µ4 and rejected H0 at the a=.05 level. I also tested H0:µ1=µ2=µ3 and rejected H0 at the a=.05 level. However, when I tested H0:µ2=µ3 using a=.05 I did not reject H0. Likewise, when I tested H0:µ1=µ4 using a=.05 I did not reject H0.
(a) Your job is to sketch a graph of the parallel dotplots of the data. That is, based on what I told you about the tests you should have an idea of how the data look. Use that idea to draw a graph. Indicate the sample means with triangles that you add to the dotplots.
(b) It is possible to get data with the same sample means that you graphed in part (a), but for which the hypothesis H0:µ1=µ2=µ3=µ4 is not rejected at the a=.05 level. Provide a graph of this situation. That is, keep the same sample means (triangles) you had from part (a), but show how the data would have been different if H0 were not to be rejected.
(18) Atley Chock (Class of '02) collected data on a random sample of 12 breakfast cereals. He recorded x = fiber (in grams/ounce) and y = price (in cents/ounce). A scatterplot of the data shows a linear relationship. The fitted regression model is
![]()
![]()
The sample correlation coefficient, r, is .23. The SE of b1 is .81. Also, sy|x = 3.1.
(a) Find r2 and interpret r2 in the context of this problem.
(b) Suppose that a cereal has 2.63 grams of fiber/ounce and costs 17.3 cents/ounce. What is the residual for this cereal?
(c) Interpret the value of sy|x in the context of this problem. That is, what does it mean to say that sy|x = 3.1?
(d) In the context of this problem explain what is meant by "the regression effect."
(19) Give a rough estimate of the sample correlation for the data in each of the scatterplots below.
![]() |
![]() |
![]() |
r = _____ |
r = _____ |
r = _____ |
(20)
A matched pairs experiment compares the taste a regular cheese pizza of Pizza
Joe’s to Domino’s. Each subject tastes
two unmarked pieces of pizza, one of each type, in random order and states which
he or she prefers. Of the 50 subjects
who participate in the study, 21 prefer Pizza Joe’s.
(a)
Find a 96% confidence interval for the proportion of
the population who prefers Pizza Joe’s to Domino’s.
(b) How large a sample is required if the desired margin of error of the confidence interval is 4%?
(21) It
was claimed that 1 out of 5 cardiologists takes an aspirin a day to prevent
hardening of the arteries. Suppose that
the claim is true. If 1500 cardiologists
are selected at random, what is the probability that at least 275 of the 1500
take an aspirin a day?
(22)
Identify whether a scatterplot would or would not be an appropriate visual summary
of the relationship between the variables. In
each case, explain your reasoning.
(a)
Blood pressure and age
(b)
Region of country and opinion about stronger gun
control laws
(c)
Verbal SAT and math SAT score
(d)
Handspan and gender (male or female)
(23) The paragraphs that follow each describe a situation which calls for some type of statistical analysis. For each you should:
(i) Give the name of an appropriate statistical procedure to apply (from the list below). You may use the same procedure more than once and some questions might have more than one correct answer.
(ii) In some problems you will also be given a p-value. Use it to reach a conclusion for that specific problem. Be sure to say something more than just Reject Ho or Fail to Reject Ho. (Assume a 5% significance level)
Some statistical procedures you might choose:
Confidence interval (for a mean, p, ...) Normal distribution
Determining sample size Correlation
Test for a mean Simple linear regression
Test for proportion Multiple regression
Difference in means (paired data) Two-way table (Chi-square test)
Difference in means (two independent samples) ANOVA for difference in means
Difference in proportions Two-way ANOVA for means
-----------------------------------------------------------------------------------------------
(a) Anthropologists have found two burial mounds in the same region. They know that several different tribes lived in the region and that the tribes have been classified according to different lengths of skulls. They measure a random sample of skulls found in each burial mound and wish to determine if the two mounds were made by different tribes. (p-value = 0.0082)
-----------------------------------------------------------------------------------------------
(b) The Hawaiian Planters Association is developing three new strains of pineapple (call them A, B, and C) to yield pulp with higher sugar content. Twenty plants of each variety (60 plants in all) are randomly distributed into a two acre field. After harvesting, the resulting pineapples are measured for sugar content and the yields are recorded for each strain. Are there significant differences in average sugar content between the three strains? (p-value = 0.987)
-----------------------------------------------------------------------------------------------
(c) Researchers were commissioned by the Violence In Children’s Television Investigative Monitors (VICTIM) to study the frequency of depictions of violent acts in Saturday morning TV fare. They selected a random sample of 40 shows which aired during this time period over a twelve week period. Suppose that 28 of the 40 shows in the sample were judged to contain scenes depicting overtly violent acts. How should they use this information to make a statement about the population of all Saturday morning TV shows?
-----------------------------------------------------------------------------------------------
(d) The Career Planning Office is interested in
seniors' plans and how they might relate to their majors. A large number of students are surveyed and
classified according to their MAJOR (Natural Science, Social Science,
Humanities) and FUTURE plans (
-----------------------------------------------------------------------------------------------
(e) Sophomore Magazine asked a random sample of 15 year olds if they were sexually active (yes or no). They would like to see if there is a difference in the responses between boys and girls. (p-value = 0.029)
-----------------------------------------------------------------------------------------------
(f) Every week during the Vietnam War, a body count (number of enemy killed) was reported by each army unit. The last digits of these numbers should be fairly random. However, suspicions arose that the counts might have been fabricated. To test this, a large random sample of body count figures was examined and the frequency with which the last digit was a 0 or a 5 was recorded. Psychologists have shown that people making up their own random numbers will use these digits less often than random chance would suggest (i.e. 103 sounds like a more "real" count than 100). If the data were authentic counts, the proportion of numbers ending in 0 or 5 should be about 0.20. (p-value=0.002)
-----------------------------------------------------------------------------------------------
(g) In one of his adventures, Sherlock Holmes found footprints made by the criminal at the scene of a crime and measured the distance between them. After sampling many people, measuring their height and length of stride, he confidently announced that he could predict the height of the suspect. How?
-----------------------------------------------------------------------------------------------
(24)
How accurate are radon detectors of a type sold to homeowners? To answer this question, university researchers
placed 12 detectors in a chamber that exposed them to 105 picocuries per liter
of radon. The detector readings found
are below. A printout of the descriptive
statistics from Minitab follows.
|
91.9 |
97.8 |
111.4 |
122.3 |
105.4 |
95.0 |
|
103.8 |
99.6 |
96.6 |
119.3 |
104.8 |
101.7 |
Variable N Mean
Median TrMean StDev
SE Mean
readings 12 104.13
102.75 103.54 9.40 2.71
Variable
Minimum Maximum Q1 Q3
readings
91.90 122.30 96.90
109.90
(a)
Is there convincing evidence that the mean 20 readings
of all detectors of this type differs from the true value of 105? Perform the appropriate hypothesis test with
α = .05.
(b)
What is the Type I error associated with this problem?
(c)
What is the Type II error associated with this problem?
(d)
What is the probability of a type II error if the
reading of the detectors is too low by 5 picocuries (really 100 when it should
read 105)?
(25)
According a Food and Drug Administration (FDA) study, a cup of coffee contains
an average of 115 mg of caffeine, with the amount per cup ranging from 60 to
180 mg depending on the brewing method. Suppose
you want to repeat the FDA experiment to obtain an estimate of the mean caffeine
content to within 5 mg with 95% using your favorite brewing method.
In problems such as this, we can estimate the standard deviation of the
population to be Ľ of the range. How many cups of coffee must you brew?
(26)
An advertisement claims that by applying a particular drug, hair is restored
to bald headed men. Outline the design
of an experiment that you would use to examine this claim. Assume that you have money to use 20 bald men
in this experiment.
(27)
A study of iron deficiency among infants compared samples of infants following
different feeding regimens. One group
contained breast-fed infants, while the children in another group were fed a
standard baby formula without any iron supplements. Here are the summary results on blood hemoglobin
levels at 12 months of age:
|
Group |
n |
|
s |
|
Breast-fed |
23 |
13.3 |
1.7 |
|
Formula |
19 |
12.4 |
1.8 |
Assume that the blood hemoglobin
levels in children (both breast-fed and formula fed) are normally
distributed. Do a significance test to
determine the statistical significance of the observed difference.
(28) Which
implies a stronger linear relationship, a correlation of +.4 or a correlation
of −.6? Briefly explain your
choice.
(29)
A group of physicians subjected the polygraph to the same careful testing given
to medical diagnostic test. They found
that if 1000 people were subjected to the polygraph and 500 told the truth and
500 lied, the polygraph would indicate that approximately 185 of the truth tellers
were liars and 120 of the liars were truth-tellers. In the application of the polygraph test, an
individual is presumed to be a truth-teller until indicated that s/he is a liar.
What is a Type I error in the context of this problem? What is the probability of a Type I error in
the context of this problem? What is
a Type II error in the context of this problem?
What is the probability of a Type II error in the context of this problem?
(30)
Audiologists have recently developed a rehabilitation program for hearing-impaired
patients in a Canadian program for senior citizens.
A simple random sample of the 30 residents of a particular senior citizens
home and the seniors were diagnosed for degree and type of sensorineural hearing
loss which was coded as follows: 1 = hear within normal limits, 2 = high-frequency
hearing loss, 3 = mild loss, 4 = mild-to-moderate loss, 5 = moderate loss, 6
= moderate-to-severe loss, and 7 = severe-to-profound loss.
The data are as follows:
6 7
1 1 2
6 4 6
4 2 5
2 5 1 5
4 6
6 5 5
5 2 5
3 6 4
6 6 4 2
(a)
Create a boxplot of the data.
(b)
Give a good description of the data.
(c)
Find a 95% confidence interval for the mean hearing
loss of senior citizens in this Canadian program. The mean and standard deviation of the above
data are 4.2 and 1.808 respectively. Interpret the interval.
(31)
A utility company was interested in knowing if agricultural customers would
use less electricity during peak hours if their rates were different during
those hours. Customers were randomly
assigned to continue to get standard rates or to receive the time-of-day structure. Special meters were attached that recorded usage
during peak and off-peak hours; the technician who read the meter did not know
what rate structure each customer had.
(a)
Is this an observation study or experiment? Defend your answer.
(b)
What is the explanatory and response variable?
(c)
List a potential confounding variable in this work.
(d)
Is this a matched pair design? Defend your answer.
(32) At the beginning of the semester, we measured the width of a page in our statistics book. Below is the scatterplot of first measurement vs. the second measurement.

(a)
Describe the distribution.
(b)
Estimate the correlation with and without the circled
point.
(33) A study in the Journal of Leisure Research investigated the relationship between academic performance and leisure activities. Each in a sample of 159 high school students was asked to state how many leisure activities they participated in weekly. From the list, activities that involved reading, writing or arithmetic were labeled “academic leisure activities.” Some of the results are as follows:
|
|
mean |
standard deviation |
|
GPA |
2.96 |
0.71 |
|
Number of leisure activities |
12.38 |
5.07 |
|
Number of academic leisure activities |
2.77 |
1.97 |
Based on these numbers (and
knowing that the GPA is a value between 0 and 4 and number of activities can
not be negative) discuss the potential skewness of each of the above variables.
(34) Events
A and B are disjoint. Discuss whether or
not A and B can be independent.
(35)
A sample of 200 mothers and a sample of 200 fathers were taken. The age of the mother when she had her first
child and the age of the father when he had his first child were recorded. Below are the dotplots

(a)
Describe the data for the mother’s age.
(b)
Describe the data for the father’s age.
(c)
Compare the distributions.
(d)
A suggestion is made to check the correlation between
the ages if we wish to compare the two populations. Is this a good suggestion? Why or why not?