B. Examples of assessment Items

Assessment items to avoid using on tests:    True/False, pure computation without a context or interpretation, items with too much data to enter and computer or analyze, items that only test memorization of definitions or formulas.

We first give some examples of assessment items with problems and commentary about the nature of the difficulty

(1)      A teacher taught two sections of elementary statistics last semester, each with 25 students, one at 8am and one at 4pm.  The means and standard deviations for the final exams were 78 and 8 for the 8am class, and 75 and 10 for the 4pm class. In examining these numbers, it occurred to the teacher that the better students probably sign up for 8am classes instead of 4pm classes. So she decided to test whether or not the mean final exam scores were equal for her two groups of students. State the hypotheses and carry out the test.

Critique: The teacher has all of the population data so there is no need to do statistical inference.

 

(2)      An economist wants to compare the mean salaries for male and female CEOs . He gets a random sample of 10 of each and does a t-test. The resulting p-value is .045.

(a)      State the null and alternative hypotheses.

(b)      Make a statistical conclusion.

(c)      State your conclusion in words that would be understood by someone with no training in statistics.

Critique: The question doesn’t address the conditions necessary for a t-test, and with the small sample sizes they are almost surely violated here. Salaries are almost surely skewed.

 

(3)      Which of the following gives the definition of a p-value?

(a)      It’s the probability of rejecting the null hypothesis when the null hypothesis is true.

(b)      It’s the probability of not rejecting the null hypothesis when the null hypothesis is true.

(c)      It’s the probability of observing data as extreme as that observed.

(d)      It’s the probability that the null hypothesis is true.

Critique: None of these answers is quite correct. Answers (b) and (d) are clearly wrong; answer (a) is the level of significance and answer (c) would be correct if it continued “... or more extreme, given that the null hypothesis is true.” 


Examples showing ways to improve some assessment items:

True/false items, even when well written, do not provide much information on student knowledge because there is always a 50% chance of getting the item right without any knowledge of the topic.   One current approach is to change the items into forced-choice questions with three or more options.  For example,

(4)      The size of the standard deviation of a data set depends on where the center is.  True of False

Changed to:

(4)      Does the size of the standard deviation of a data set depend on where the center is located?

(a)      Yes, the higher the mean, the higher the standard deviation.

(b)      Yes, because you have to know the mean to calculate the standard deviation.

(c)      No, the size of the standard deviation is not affected by the location of the distribution.

(d)      No, because the standard deviation only measures how the values differ from each other, not how they differ from the mean.

 

(5)      A correlation of +1 is stronger than a correlation of -1.   True or False

Rewritten as:

(5)      A recent article in an educational research journal reports a correlation of   +.8 between math achievement and overall math aptitude. It also reports a correlation of  -.8 between math achievement and a math anxiety test.  Which of the following interpretations is the most correct?

(a)      The correlation of +.8 indicates a stronger relationship than the correlation of -.8

(b)      The correlation of +.8 is just as strong as the correlation of -.8

(c)      It is impossible to tell which correlation is stronger

 

Context is important for helping students see and deal with statistical ideas in real world situations.

(6)      Once it is established that X and Y are highly correlated, what type of study needs to be done in order to establish that a change in X causes a change in Y?

A context is added:

(6)      A researcher is studying the relationship between an experimental medicine and T4 lymphocyte cell levels in HIV/AIDS patients. The T4 lymphocytes, a part of the immune system, are found at reduced levels in patients with the HIV infection. Once it is established that the two variables, dosage of medicine and T4 cell levels, are highly correlated, what type of study needs to be done in order to establish that a change in dosage causes a change in T4 cell levels?

(a)      correlational study

(b)      controlled experiment

(c)      prediction study

(d)      survey

Try to avoid repetitious/tedious calculations on exams that may become the focus of the problem for the students at the expense of concepts and interpretations.

 

(7)      A First Year Program course used a final exam that contained a 20 point essay question that asked students to apply Darwinian principles to analyze the process of expansion in major league sports franchises.  To check for consistency in grading among the four professors in the course a random sample of six graded essays were selected from each instructor.  The scores are summarized in the table below.  Construct an ANOVA table to test for a difference in means among the four instructors.

Instructor  Scores

----------  ---------------

Affinger 18 11 10 12 15 12

Beaulieu 14 14 11 14 11 14

Cleary   19 20 15 19 19 16

Dean     17 14 17 15 18 15

 

Critique: The version of the question above requires a fair amount of pounding on the calculator to get the results and never even asks for an interpretation.  The revision below still requires some calculation (which can be adjusted depending on the amount of computer output provided) but the calculations can be done relatively efficiently - especially by students who have a good sense of what the computer output is providing.

 

(7)      A First Year Program course ... (same intro as above) ... The scores are summarized in the table below, along with some Descriptive Statistics for the entire sample and a portion of the Oneway ANOVA output.

Instructor  Scores

--------  -------------

Affinger 18 11 10 12 15 12

Beaulieu 14 14 11 14 11 14

Cleary   19 20 15 19 19 16

Dean     17 14 17 15 18 15

 
 


Descriptive Statistics

Variable        N     Mean   Median   TrMean    StDev   SEMean

Score          24   15.000   15.000   15.000    2.919    0.596

 

One-way Analysis of Variance

              *** ANOVA TABLE OMITTED ***

                                   Individual 95% CIs For Mean

                                   Based on Pooled StDev

 Level      N      Mean     StDev  ------+---------+---------+---------+

Afinger     6    13.000     2.966   (------*------)

Beaulieu    6    13.000     1.549   (------*------)

Cleary      6    18.000     2.000                       (------*------)

Dean        6    16.000     1.549               (------*------)

                                   ------+---------+---------+---------+

Pooled StDev =    2.098               12.5      15.0      17.5      20.0

(a)      Unfortunately, we are missing the ANOVA table from the Minitab output. Use the information given above to construct the ANOVA table and conduct a test (5% level) for any significant differences among the average scores assigned by the four instructors.  Be sure to include hypotheses and a conclusion.   If you have trouble getting one part of the table that you need to complete the rest (or the next question), make a reasonable guess or ask for assistance (for a small point fee).

(b)      After completing the ANOVA table, construct a 95% confidence interval for the average score given by Dr. Affinger.  Note: Your answer should be consistent with the graphical display given by Minitab.

 


 

Some additional examples of good assessment items  

 

(8)      Let Y denote the amount a student spends on textbooks for one semester.  Suppose Nancy, who is statistically savvy, wants to know how fall, semester 1, and spring, semester 2, compare.  In particular, suppose she is interested in the averages m1 and m2.  You may assume that Nancy has taken several statistics courses and knows a lot about statistics, including how to interpret confidence intervals and hypothesis tests.  You have random samples from each semester and are to analyze the data and write a report.  You seek advice from 4 persons:

Rudd says “Conduct an a=.05 test of H0: m1=m2 vs. HA: m1m2 and tell Nancy whether or not you reject H0.”

Linda says “Report a 95% confidence interval for mm2 .”

Steve says “Conduct a test of H0: m1=m2 vs. HA: m1m2 and report to Nancy the p-value from the test.”

Gloria says “Compare to .  If  > then test H0m1=m2 vs. HAm1>m2 using a=.05 and tell Nancy whether or not you reject H0.  If  < then test H0m1=m2 vs. HAm1<m2 using a=.05 and tell Nancy whether or not you reject H0.”

 

Rank the 4 pieces of advice from worst to best and explain why you rank them as you do.  That is, explain what makes one better than another.

 

 

(9)      Researchers took random samples of subjects from two populations and applied a Wilcoxon-Mann-Whitney test to the data; the P-value for the test, using a non-directional alternative, was .06.  For each of the following, say whether the statement is True or False and say why.

(a)      There is a 6% chance that the two population distributions really are the same.

(b)      If the two population distributions really are the same, then a difference between the two samples as extreme as the difference that these researchers observed would only happen 6% of the time.

(c)      If a new study were done that compared the two populations, there is a 6% probability that H0 would be rejected again.

(d)      If a = .05 and a directional alternative were used, and the data departed from H0 in the direction specified by the alternative hypothesis, then H0 would be rejected.

 

(10)  An article on the CNN web page on Monday (http://www.cnn.com/HEALTH/9612/16/faith.healing/index.html) begins with the sentence "Family doctors overwhelmingly believe that religious faith can help patients heal, according to a survey released Monday."  Later the article states "Medical researchers say the benefits of religion may be as simple as helping the immune system by reducing stress" and Dr. Harold Koenig is reported to say that "people who regularly attend church have half the rate of depression of infrequent churchgoers." 

Use the language of statistics to critique the statement by Dr. Koenig and the claim, suggested by the article, that religious faith and practice help people fight depression.  You will want to select some of the following words in your critique: observational study, experiment, blind, double-blind, precision, bias, sample, spurious, confounding, causation, association, random, valid, reliable.

 

(11)  Francisco Franco (Class of '98) weighed 100 Hershey's Kisses (with almonds).  He found that the sample average was 4.80 grams and the SD was .28 grams.  In the context of this setting, explain what is meant by the sampling distribution of an average.

 

(12)  A gardener wishes to compare the yields of three types of pea seeds - type A, type B, and type C.  She randomly divides the type A seeds into three groups and plants some in the east part of her garden, some in the central part of the garden, and some in the west part of the garden.  Then she does the same with the type B seeds and with the type C seeds.

(a)      What kind of experimental design is the gardener using?

(b)      Why is this kind of design used in this situation?  (Explain in the context of the situation.)

 

(13)  The following scatterplot shows how divorce rate, y, and marriage rate, x, are related for a collection of 10 countries.  The regression line has been added to the plot.

 

(a)      The U.S. is not one of the 10 points in the original collection of countries.  It happens that the U.S. has a higher marriage rate than any of the 10 countries.  Moreover, the divorce rate for the U.S. is higher than one would expect, given the pattern of the other countries.  How would adding the U.S. to the data set affect the regression line?  Why?

(b)      Think about the scatterplot and regression line after the U.S. has been added to the data set.  Provide a sketch of the residual plot.  Label the axes and identify the U.S. on your plot with a triangle.

 

(14)  Researchers wanted to compare two drugs, formoterol and salbutamol, in aerosol solution, to a placebo for the treatment of patients who suffer from exercise-induced asthma.  Patients were to take a drug or the placebo, do some exercise, and then have their "forced expiratory volume" measured.  There were 30 subjects available. (Based on A.N. Tsoy, et al., European Respiratory Journal 3 (1990): 235; via Berry, Statistics: A Bayesian Perspective.)

(a)      Should this be an experiment or an observational study?  Why?

(b)      Within the context of this setting, what is the placebo effect?

(c)      Briefly explain how to set up a randomized blocks design (RBD) here.

(d)      How would an RBD be a helpful?  That is, what is the main advantage of using a RBD in a setting like this?

 

(15)  I noticed that 8 students from the 114 class attended the review session prior to the second exam (in April).  The average score among those 8 students was lower than the average for the 21 students who did not attend the review session.  Suppose I want to use this information in a study of the effectiveness of review sessions.

(a)      What kind of study is this: observational or experimental?  Why?

(b)      What kind(s) of sampling error(s) or bias(es) might be of concern here?

(c)      (Hypothetical)  I gave the data for the 8 who attended and for the 21 who did not attend to my friend George.  George used the data to conduct a hypothesis test.  Does a hypothesis test make sense?  If so, what is H0?  If not, why not?

 

(16)  For each of the following three settings, state the type of analysis you would conduct (e.g., one-sample t-test, regression, Chi-square test of independence, Chi-square goodness-of-fit test, etc.) if you had all of the raw data and specify the roles of the variable(s) on which you would perform the analysis, but do not actually carry out the analysis.

(a)      Elizabeth Larntz (Class of ‘02) measured the effect of exercise on pulse for each of 13 students.  She measured pulse before and after exercise (doing 30 jumping jacks) and found that the average change was 55.1 and the SD of the changes was 18.4.  How would you analyze the data?

(b)      Three HIV treatments were tested for their effectiveness in preventing progression of HIV in children.  Of 276 children given drug A, 259 lived and 17 died. Of 281 children given drug B, 274 lived and 7 died.  Of 274 children given drug C, 264 lived and 10 died.  How would you analyze the data?

(c)      A researcher was interested in the relationship between forearm length and height.  He measured the forearm lengths and heights of each of 16 women.  How would you analyze these data?

 

(17)  I had Data Desk construct parallel dotplots of the data from four samples.  I then conducted a test of H01234 and rejected H0 at the a=.05 level.  I also tested H0123 and rejected H0 at the a=.05 level.  However, when I tested H023 using a=.05 I did not reject H0.  Likewise, when I tested H014 using a=.05 I did not reject H0.

(a)      Your job is to sketch a graph of the parallel dotplots of the data.  That is, based on what I told you about the tests you should have an idea of how the data look.  Use that idea to draw a graph.  Indicate the sample means with triangles that you add to the dotplots.

(b)      It is possible to get data with the same sample means that you graphed in part (a), but for which the hypothesis H01234 is not rejected at the a=.05 level.  Provide a graph of this situation.  That is, keep the same sample means (triangles) you had from part (a), but show how the data would have been different if H0 were not to be rejected.

 

(18)  Atley Chock (Class of '02) collected data on a random sample of 12 breakfast cereals.  He recorded x = fiber (in grams/ounce) and y = price (in cents/ounce).  A scatterplot of the data shows a linear relationship.  The fitted regression model is

The sample correlation coefficient, r, is .23.  The SE of b1 is .81.   Also, sy|x = 3.1.

(a)      Find r2 and interpret r2 in the context of this problem.

(b)      Suppose that a cereal has 2.63 grams of fiber/ounce and costs 17.3 cents/ounce.  What is the residual for this cereal?

(c)      Interpret the value of sy|x in the context of this problem.  That is, what does it mean to say that sy|x = 3.1?

(d)      In the context of this problem explain what is meant by "the regression effect."

 

(19)  Give a rough estimate of the sample correlation for the data in each of the scatterplots below.

r = _____
r = _____
r = _____

(20)  A matched pairs experiment compares the taste a regular cheese pizza of Pizza Joe’s to Domino’s.  Each subject tastes two unmarked pieces of pizza, one of each type, in random order and states which he or she prefers.  Of the 50 subjects who participate in the study, 21 prefer Pizza Joe’s.  

(a)      Find a 96% confidence interval for the proportion of the population who prefers Pizza Joe’s to Domino’s.

(b)      How large a sample is required if the desired margin of error of the confidence interval is 4%?

 

(21)  It was claimed that 1 out of 5 cardiologists takes an aspirin a day to prevent hardening of the arteries.  Suppose that the claim is true.  If 1500 cardiologists are selected at random, what is the probability that at least 275 of the 1500 take an aspirin a day?

(22)  Identify whether a scatterplot would or would not be an appropriate visual summary of the relationship between the variables.  In each case, explain your reasoning. 

(a)      Blood pressure and age

(b)      Region of country and opinion about stronger gun control laws

(c)      Verbal SAT and math SAT score

(d)      Handspan and gender (male or female)

 

 

(23)  The paragraphs that follow each describe a situation which calls for some type of statistical analysis.  For each you should:

         (i)     Give the name of an appropriate statistical procedure to apply (from the list below). You may use the same procedure more than once and some questions might have more than one correct answer.

         (ii)    In some problems you will also be given a p-value. Use it to reach a conclusion for that specific problem.  Be sure to say something more than just Reject Ho or Fail to Reject Ho.            (Assume a 5% significance level)

Some statistical procedures you might choose:

   Confidence interval (for a mean, p, ...)                        Normal distribution

   Determining sample size                                              Correlation

   Test for a mean                                                          Simple linear regression

   Test for proportion                                                     Multiple regression

   Difference in means (paired data)                                Two-way table (Chi-square test)

   Difference in means (two independent samples)           ANOVA for difference in means

   Difference in proportions                                            Two-way ANOVA for means

 

-----------------------------------------------------------------------------------------------

(a) Anthropologists have found two burial mounds in the same region. They know that several different tribes lived in the region and that the tribes have been classified according to different lengths of skulls. They measure a random sample of skulls found in each burial mound and wish to determine if the two mounds were made by different tribes.  (p-value = 0.0082)

-----------------------------------------------------------------------------------------------

(b) The Hawaiian Planters Association is developing three new strains of pineapple (call them A, B, and C) to yield pulp with higher sugar content. Twenty plants of each variety (60 plants in all) are randomly distributed into a two acre field. After harvesting, the resulting pineapples are measured for sugar content and the yields are recorded for each strain.  Are there significant differences in average sugar content between the three strains? (p-value = 0.987)

-----------------------------------------------------------------------------------------------

(c)  Researchers were commissioned by the Violence In Children’s Television Investigative Monitors (VICTIM) to study the frequency of depictions of violent acts in Saturday morning TV fare.  They selected a random sample of 40 shows which aired during this time period over a twelve week period.  Suppose that 28 of the 40 shows in the sample were judged to contain scenes depicting overtly violent acts.  How should they use this information to make a statement about the population of all Saturday morning TV shows?

-----------------------------------------------------------------------------------------------

(d)  The Career Planning Office is interested in seniors' plans and how they might relate to their majors.  A large number of students are surveyed and classified according to their MAJOR (Natural Science, Social Science, Humanities) and FUTURE plans (Graduate School, Job, Undecided).  Are the type of major and future plans related?  (p-value = 0.047)

-----------------------------------------------------------------------------------------------

(e)  Sophomore Magazine asked a random sample of 15 year olds if they were sexually active (yes or no).  They would like to see if there is a difference in the responses between boys and girls.     (p-value = 0.029)

-----------------------------------------------------------------------------------------------

 (f) Every week during the Vietnam War, a body count (number of enemy killed) was reported by each army unit. The last digits of these numbers should be fairly random.  However, suspicions arose that the counts might have been fabricated.  To test this, a large random sample of body count figures was examined and the frequency with which the last digit was a 0 or a 5 was recorded.  Psychologists have shown that people making up their own random numbers will use these digits less often than random chance would suggest (i.e. 103 sounds like a more "real" count than 100).  If the data were authentic counts, the proportion of numbers ending in 0 or 5 should be about 0.20.  (p-value=0.002)

-----------------------------------------------------------------------------------------------

(g) In one of his adventures, Sherlock Holmes found footprints made by the criminal at the scene of a crime and measured the distance between them. After sampling many people, measuring their height and length of stride, he confidently announced that he could predict the height of the suspect. How?

-----------------------------------------------------------------------------------------------

 

(24)  How accurate are radon detectors of a type sold to homeowners?  To answer this question, university researchers placed 12 detectors in a chamber that exposed them to 105 picocuries per liter of radon.  The detector readings found are below.  A printout of the descriptive statistics from Minitab follows.

91.9

97.8

111.4

122.3

105.4

95.0

103.8

99.6

96.6

119.3

104.8

101.7

 

Variable       N       Mean     Median     TrMean      StDev    SE Mean

readings      12     104.13     102.75     103.54       9.40       2.71

 

Variable       Minimum    Maximum         Q1         Q3

readings         91.90     122.30      96.90     109.90

 

(a)      Is there convincing evidence that the mean 20 readings of all detectors of this type differs from the true value of 105?  Perform the appropriate hypothesis test with α = .05.

(b)      What is the Type I error associated with this problem?

(c)      What is the Type II error associated with this problem?

(d)      What is the probability of a type II error if the reading of the detectors is too low by 5 picocuries (really 100 when it should read 105)?

 

(25)  According a Food and Drug Administration (FDA) study, a cup of coffee contains an average of 115 mg of caffeine, with the amount per cup ranging from 60 to 180 mg depending on the brewing method.  Suppose you want to repeat the FDA experiment to obtain an estimate of the mean caffeine content to within 5 mg with 95% using your favorite brewing method.  In problems such as this, we can estimate the standard deviation of the population to be Ľ of the range.  How many cups of coffee must you brew? 

 

(26)  An advertisement claims that by applying a particular drug, hair is restored to bald headed men.  Outline the design of an experiment that you would use to examine this claim.  Assume that you have money to use 20 bald men in this experiment. 

 

(27)  A study of iron deficiency among infants compared samples of infants following different feeding regimens.  One group contained breast-fed infants, while the children in another group were fed a standard baby formula without any iron supplements.  Here are the summary results on blood hemoglobin levels at 12 months of age:

Group

n

s

Breast-fed

23

13.3

1.7

Formula

19

12.4

1.8

 

Assume that the blood hemoglobin levels in children (both breast-fed and formula fed) are normally distributed.  Do a significance test to determine the statistical significance of the observed difference.

 

(28)  Which implies a stronger linear relationship, a correlation of +.4 or a correlation of −.6?  Briefly explain your choice.

 

(29)  A group of physicians subjected the polygraph to the same careful testing given to medical diagnostic test.  They found that if 1000 people were subjected to the polygraph and 500 told the truth and 500 lied, the polygraph would indicate that approximately 185 of the truth tellers were liars and 120 of the liars were truth-tellers.  In the application of the polygraph test, an individual is presumed to be a truth-teller until indicated that s/he is a liar. What is a Type I error in the context of this problem?  What is the probability of a Type I error in the context of this problem?  What is a Type II error in the context of this problem?  What is the probability of a Type II error in the context of this problem?

 

(30)  Audiologists have recently developed a rehabilitation program for hearing-impaired patients in a Canadian program for senior citizens.  A simple random sample of the 30 residents of a particular senior citizens home and the seniors were diagnosed for degree and type of sensorineural hearing loss which was coded as follows: 1 = hear within normal limits, 2 = high-frequency hearing loss, 3 = mild loss, 4 = mild-to-moderate loss, 5 = moderate loss, 6 = moderate-to-severe loss, and 7 = severe-to-profound loss.  The data are as follows: 

6  7  1  1  2  6  4  6  4  2  5  2  5  1  5

4  6  6  5  5  5  2  5  3  6  4  6  6  4  2

(a)      Create a boxplot of the data.

(b)      Give a good description of the data.

(c)      Find a 95% confidence interval for the mean hearing loss of senior citizens in this Canadian program.   The mean and standard deviation of the above data are 4.2 and 1.808 respectively. Interpret the interval.

 

(31)  A utility company was interested in knowing if agricultural customers would use less electricity during peak hours if their rates were different during those hours.  Customers were randomly assigned to continue to get standard rates or to receive the time-of-day structure.  Special meters were attached that recorded usage during peak and off-peak hours; the technician who read the meter did not know what rate structure each customer had.

(a)      Is this an observation study or experiment?  Defend your answer.

(b)      What is the explanatory and response variable?

(c)      List a potential confounding variable in this work.

(d)      Is this a matched pair design?  Defend your answer.

(32)  At the beginning of the semester, we measured the width of a page in our statistics book.  Below is the scatterplot of first measurement vs. the second measurement.

(a)      Describe the distribution.

(b)      Estimate the correlation with and without the circled point.

 

(33)  A study in the Journal of Leisure Research investigated the relationship between academic performance and leisure activities.  Each in a sample of 159 high school students was asked to state how many leisure activities they participated in weekly.  From the list, activities that involved reading, writing or arithmetic were labeled “academic leisure activities.”  Some of the results are as follows:

                                                          

mean   

standard deviation

GPA                                             

2.96

0.71

Number of leisure activities            

12.38

5.07

Number of academic leisure activities

2.77

1.97

 

Based on these numbers (and knowing that the GPA is a value between 0 and 4 and number of activities can not be negative) discuss the potential skewness of each of the above variables.

 

(34)  Events A and B are disjoint.  Discuss whether or not A and B can be independent.

 

(35)  A sample of 200 mothers and a sample of 200 fathers were taken.  The age of the mother when she had her first child and the age of the father when he had his first child were recorded.  Below are the dotplots

(a)      Describe the data for the mother’s age.

(b)      Describe the data for the father’s age.

(c)      Compare the distributions.

(d)      A suggestion is made to check the correlation between the ages if we wish to compare the two populations.  Is this a good suggestion?  Why or why not?

 

 

 

Return to GAISE Report Appendix Contents