University Students' Ability to Apply Statistical Procedures

Paul L. Gardner and Ingrid Hudson
Monash University

Journal of Statistics Education v.7, n.1 (1999)

Copyright (c) 1999 by Paul L. Gardner and Ingrid Hudson, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.


Appendix 2: Qualitative Findings

This appendix reports the qualitative data obtained during the interviews. Quotes attributed to students are based on interviewers' notes and are not necessarily verbatim. Some minor editing has been done to facilitate readability. The appendix has been organised into categories, each containing a group of items related to a major statistical theme. The number in parentheses following the item code has been added here to facilitate location of the relevant item in Appendix 1.

Descriptive Comparisons Using z-scores

Item RC (21): z-score conversion and comparison

This item described a situation in which a teacher wished to compare a particular student's relative ability on two different tests. Of the 11 students who answered this question, three mentioned z-scores, with various levels of understanding. S8 correctly noted that there were "two different distributions, each with mean and standard deviation" and that z-score conversion could be used to "standardise and see which is higher." S16 proposed the same idea in visual form: "use z-scores; look at the bell-curve and work out the number of standard deviations the score is from the mean in each category." S1 also mentioned z-scores, but justified this incorrectly by stating that the procedure permitted a comparison "with the population" (although it is possible that `population' is being misused here to mean `sample'). A fourth student proposed a different but equally correct idea: "percentiles, after standardising the scores."

S19 also mentioned percentile ranks, but for the data given in the problem, it would not be possible to obtain these without doing a z-score conversion first and then assuming that the sample data ought to be normally distributed. S18 suggested that if one assumed normality of distributions, one could calculate a "standardised t variant" [probably referring to T-scores] to make the comparison, but then displayed conceptual confusion with t-tests by stating that the procedure resembled "using t-score with one degree of freedom." S17 considered multiplying one set of scores by two, displaying the common layman's misconception that this would weight them equally, and ignoring the issue of the differing standard deviations. S11 missed the point entirely in suggesting that the "means of groups" should be compared.

Item SE (18): Effect size

This item presented a situation of a literature reviewer wishing to compare the treatment effects in two studies in which different measures of the dependent variable had been used. This proved to be among the most difficult items in the set, with only one of eleven students (S7) answering appropriately. He suggested using a "standard z score, because there were different scales of measures, and there was a need to obtain a standard basis of comparison."

S10 recognised that this was an example of "meta-analysis," but did not know which statistic to employ. Several students proposed using either one or two t-tests: "We're comparing the difference between two treatments; we want to know if each group is significantly different from the control group" (S3); "a t-test for independent samples, because we're comparing the means of two different treatments, with between-subject means, not a single subject with several scores" (S6); "to compare the means and see which is higher, how much variation there is between the control group and the comparison group for both Treatment A and B" (S14); "because we're comparing means" (S15). S12, entirely off the track, indicated that chi-squared should be used to find "the difference in the measures, with four different groups with two treatments, two-by-two."

Correlation, Regression, and Factor Analysis

Item TC(7): Spearman r

Item TC presents data on two variables, one already ranked (order of completion of a test), and the other a set of grades easily convertible into ordinal data. Although the appropriate statistical test for determining the strength of the association might seem to be the Spearman correlation, there are some subtleties in this item. If the Spearman formula were to be employed (considered as an acceptable answer to this item), the ordinal data would have to converted into ranks (all ranked data are ordinal, but not all ordinal data are ranks). A more sophisticated answer would include the recognition that it doesn't matter much which correlation coefficient is used, since entering ranks into the Pearson formula normally results in precisely the same value of the correlation coefficient as entering them into the Spearman formula. An even more sophisticated student might add that with a sample size of 16 and only five grades available, numerous tied ranks would be inevitable, which would result in errors if the Spearman coefficient were used. The ideal procedure here would be to use the Pearson r on the ordinal data.

No one offered a sophisticated answer. S13 simply answered "correlation, maybe Spearman's" because "the order of completion is ordinal." S12 chose Spearman's "ranking test ... because the order of completion is ordinal" without ever mentioning the term `correlation.' Two others (S8, S23) spoke of "correlation for ranked data" without naming Spearman; we classified these as correct (at the minimal level of acceptability).

The remaining eight students showed either partial understanding, or a variety of misconceptions and faulty reasoning. Students S1, S3, S5, and S9 recognised that it was a correlation problem, without knowing what kind; S5 said she "only knows one kind," while S9 proposed plotting a graph and checking for linearity (always a sensible procedure). S17 proposed using Pearson r, but for a confused reason: she had considered Spearman, but then rejected it because it can be used "only when Pearson's assumptions can't be satisfied." If there were an "insufficient sample size," she asserted, Spearman's should be used. Even if her reasoning was faulty, this student was the only one to even mention "assumptions" in the context of this problem. S16 was way off-beam, proposing ANOVA "because it's not chi-square and not a t-test."

Item DM(2): Pearson r

This item, concerned with the relationship between rates of drinking and the tempo of country-western music in a bar, presented ratio-scale data for which Pearson r was clearly appropriate. Of the eleven students who attempted this item, four (S5, S8, S20, S23) specifically mentioned the Pearson statistic. S20 showed deeper understanding than most, by commenting that the amount of variance could be found by obtaining r-squared. S23 had considered linear regression, but then realised that this procedure would just show where the mean points lie, without giving the strength of the association. In contrast, S5 weakened her answer by asserting (incorrectly) that "the test has already been done"; she was probably referring to the information in the item that a scatterplot had shown a linear relationship, and was possibly confusing the establishment of a linear relationship with the obtaining of a value of r which described the strength of that relationship.

S2 said that she did not know what test to use, suggested that the reference to `association' in the item implied correlation, but then argued (along the same lines as S5) that "the relationship has already been found to be linear, so that correlation is not what is required." S1 (as he did for the Spearman problem) simply stated that correlation was the appropriate test, without specifying the type of correlation coefficient; his explanation ("in order to plot two things against each other") signals a failure to recognise that correlational procedures yield a numerical value, and not just a graphical representation. S11 spoke about scatter plots and linearity (given in the problem) without mentioning correlation; S15 mentioned linear regression, probably triggered by the reference to `linear relation' in the item. All of these students have failed to grasp that correlational procedures yield a numerical coefficient which describes the strength of the relationship.

S16 suggested using the t-test, on the spurious grounds that one variable [rate of drinking] was given as a set of mean values; S21, similarly, referred to the independent t-test. Both of these students displayed a total inability to distinguish research questions about relationships from those involving differences. S21 displayed further confusion by claiming that if more than two variables had been involved, ANOVA could have been used. S4 misinterpreted the sample data, claiming (inaccurately) that "there's no point in doing a test ... there is no real relationship between the two variables ... the mean number of sips doesn't seem to fluctuate much."

Item GI(19): Multiple correlation

This item called for a statistic which would describe the strength of the relationship between one dependent variable and two predictor variables. It was the most difficult item in the set, with no one offering a correct answer. Two students (S7, S8) were in the right conceptual area by proposing "multiple regression" or "regression analysis"; S7 stated that this would allow one to find "the strength of the relation and the combined effect," while S8 argued that we were "looking at the strength of the relationship, that's what correlations and regressions allow you to do, you can figure the combined effect." It is true that multiple regression analyses always compute the multiple correlation, but neither answer made any specific mention of this concept.

Other answers demonstrated various misconceptions. S14 proposed "correlation," but was unable to elaborate: "I'm not sure how the variables would be combined." S11 asserted that one should "start with correlation, and then use hierarchical regression, but we're not looking at prediction, and correlation can't be used with combined effects." S15 correctly identified the nature of the research question ("we are looking at the strength of the relationship, not measuring differences"), but proposed the use of chi-squared. S16 answered ANOVA, "because there are two variables." S4 also proposed using ANOVA, but recognised a problem in that "the variables are on different scales" while S22 proposed combining two of the variables (in some unspecified way) and then using a t-test for "non-independent samples" to "look for a relationship between two variables in one sample." The answers of these last two students reveal a fundamental inability to identify the nature of the research question.

Item HT(5): Simple regression

This item required the prediction of a new value from a linear relationship between two continuous variables. This was the easiest item in the set. Seven of the 13 students who attempted this item were able to correctly name the appropriate procedure as simple (or linear) regression, and to justify their answer by referring to the need to predict a missing score from other information (S1, S8, S9, S15, S20, S21, S23). S12 and S17 displayed some qualitative understanding by referring to "a scatter plot and then read off the graph" or a "line of best fit based on a certain score" without naming the procedure or displaying any understanding that a statistical formula was involved. S5 proposed using "hierarchical regression" and displayed misunderstanding of the research design by adding that ANOVA would also work "because it involves interactions." S16 offered an entirely irrelevant answer by proposing that a t-test should be used "to find an association between the factors" (which displays equally deep confusion about the purpose of t-tests!). S4 didn't know what procedure to use, claiming that there was "missing data ... a gap in the information."

Item AA(10): Multiple regression

This item, calling for a prediction of a value of a dependent variable given values of two predictor variables, was a straightforward multiple regression problem; it was answered correctly by four of the 12 students who attempted it (S8, S9, S10, S18). S5 mentioned both multiple regression and ANOVA, but displayed a misconception about what regression does when she stated that the requirement in the item for making a prediction about "another child" made the item ambiguous: "usually you predict for the same subjects," which entirely misses the point of what regression analyses are for.

Four other students (S1, S2, S6, S17) referred to "regression" without mentioning "multiple," but demonstrated understanding of the procedure, e.g., S6 said, "You need to predict based on some scores already given," although, like S5, expressed puzzlement over the fact that "the question revolves around a completely different child; usually you predict from the same subject's scores."

S14 proposed Pearson r, noting, with some degree of uncertainty, that perhaps "the two variables should be combined first and then compared with the third." S13 proposed "percentiles." S22 initially suggested "t-test for independent samples," displaying a complete misreading of the research situation by asserting that "each sample is different, so independent"; apparently, this student has confused `variables' and `samples'; however, this initial suggestion was later withdrawn.

Item YM(20): Factor analysis

This item, which required students to propose a technique for investigating underlying patterns of relationships among five psychological variables, was quite difficult. This is perhaps not surprising, given that this procedure is probably not taught in most introductory statistics courses. Only one of 12 students (S5) specifically referred to factor analysis as an appropriate procedure. (This same student displayed a basic misconception in the YM item on multiple regression and performed fairly poorly on most other items.) Another student (S18), among the best performers with four out of eight items completely correct, displayed good partial understanding of what was involved in this item by proposing to obtain a correlation matrix: "plot the five variables against each other and look for any big relationships" (which is basically what factor analysis does). S23 also proposed obtaining a correlation matrix. S11 similarly suggested finding "correlations between all the variables to find significance in patterns," but then went on to refer to "hierarchical regression" (this seemed to a favourite answer to several items). Other students failed to recognise that in this item, there were no dependent variables, and proposed irrelevant solutions such as multiple correlation (S17), ANOVA (S4), and MANOVA (S1).

Item II(13): Fisher z transformation

This item, calling for a comparison between two non-zero correlation coefficients was extremely difficult, and was answered correctly by only one student (S18). Even this student had an imperfect understanding of why the Fisher test is required; in his rather confused explanation, he stated that we are "assuming normal distribution of scores in order to work with distributions of correlations, instead of knowing the distribution of r to see whether 0.58 falls within the transformed distribution." (This misses the point that the sampling distribution of r when samples are drawn from a population displaying a non-zero correlation between two variables is skewed; the Fisher z normalises this distribution and thus permits probability inferences to be made.)

All other answers were incorrect. S22 suggested using a "correlation matrix ... to show the significant differences in the variables, by comparing the variables together to determine differences and their significance." S10 proposed that one should "compare the means and then examine the correlations, and then look at the statistical significance of the correlations and the means of the groups." S7, S9 and S14 suggested that a t-test be used. S3 and S17 nominated chi-squared (S3 actually considered using Fisher z, but then rejected this idea). In several of these explanations, we observe an interesting phenomenon, namely that of "key words" triggering off associations with irrelevant statistical tests. S9, explaining her choice of t-test, stated that we were "looking for a significant difference between two values." S3 justified her choice of chi-squared with "You're looking at observed and expected frequencies; we need to compare the observed correlation with an expected correlation value." S17 offered a less elaborate answer, but also spoke about observed scores and expected values. S19 did not offer an answer, but mentioned chi-squared as a possibility, noting that what was required was something that would show "the difference between an observed and an expected value." The difficulty with this item is not surprising: the concept is probably not emphasised in most introductory courses.

Differences Between Groups: t-tests

Three items tested the ability to recognise situations requiring the use of the t statistic, as a one-sample test, or for comparing independent samples, or for comparing correlated samples.

Item RA(11): t-test, single sample

Five of the 12 students who attempted this item answered it correctly, four of them (S8, S9, S13, S18) recognising that the situation called for a comparison between a sample mean and a population mean, and three (S13, S18, S21) commenting that the [sample] standard deviation was known. Only one of these students (S13) was able to label the test specifically as a "t-test for one sample." S18 displayed deeper understanding by noting that he had considered the z-test, but rejected this because the population standard deviation was not given, adding however that since the reading test was a standardised one, perhaps such information was available. Another student (S20) considered using the t-test to compare the sample and population means, but then veered away from this answer and argued that "the number of normal standard deviations the score is from the mean" was all that was necessary (essentially, the concept of effect size, without the label).

Others displayed a wide range of misunderstandings. Two students considered the situation as a comparison between two independent groups (S3) or two populations (S17). The latter comment indicates a misunderstanding of the precise technical meaning of `population.' Two students (S6, S14) thought that the provision of the standard deviation information indicated that the data had to be transformed to z-scores. One student (S19) confused t-tests with chi-squared, and displayed further misunderstanding by asserting that the large standard deviation meant that the distribution was probably skewed and that the mean was therefore inappropriate for comparing scores.

Item AE(15): t-test, independent samples

This item, requiring a comparison to be made of the reaction times of two independent groups, was one of the (relatively) easy items, and was answered correctly by seven of the 12 students who attempted it. S7 was able to articulate clearly the grounds for his choice: the "clear distinction between the dependent and independent variables ... the interval and categorical" nature of these two variables, noting also that the data were "not paired" and therefore the t-test for paired samples was not appropriate. Others (S3, S9, S13, S21, S23) were correct, but offered less elaborate explanations (e.g., "comparing two groups"). S10 did not mention t-tests, but correctly proposed using ANOVA, pointing out that there was "only one level of condition," and that the test would assess "the contribution of one variable to the outcome."

Others displayed various levels of confusion. S6 also suggested using ANOVA, but justified this by asserting that the data were "nominal"; t-tests also came to mind for this student, because "they are appropriate for small samples." S1 answered "t-test," without any elaboration of the type required. S15 suggested using the t-test for matched samples, expressing uncertainty as to whether the samples were matched, but arguing that they ought to be: "comparing the effects would require a matched procedure." S19 proposed using chi-squared, displaying total misunderstanding of the nature of the data.

Item CF(6): t-test for correlated samples

Of the 12 students who attempted this item, five gave fully correct answers. Four of these students (S8, S15, S17, S23) made specific reference to "paired samples," or "matched samples" or "matched pairs" and identified the t-test as appropriate for testing the significance of difference between groups. S17, however, labelled the test as a t-test for "dependent samples." S15 articulated the research question clearly: the "need to find a simple difference between the cat foods, and not the litters," and rejected using ANOVA for this reason. S8 correctly explained that the research design allowed for variability to be controlled by comparing two groups of paired subjects, but also displayed some imprecision about what was being compared by stating that the test allowed one to look for "significant differences between paired scores and zero." S23 commented correctly that the layout of the data in the table made it look like an ANOVA situation. The fifth student (S9) also considered using ANOVA, but rejected this on the (spurious) grounds that "no interaction was involved." S1 and S2 were partly correct, mentioning t-tests, without identifying the specific type needed in this situation, the latter student admitting to difficulty in recalling the names of various tests.

S12 explicitly nominated the wrong type of t-test ("non-matched pairs") and backed this up with a confused justification: the research is "comparing two things across two sets of subjects ... ten pairs of kittens eating different foods, so not being compared against each other, so not paired." S13 had difficulty in deciding how the kittens had been paired; she considered the research design to involve a comparison of ten independent samples of kittens, and proposed using ANOVA. (This would answer the question of whether the pairs of kittens differed from one another in weight-gain, but not whether the type of cat food influenced their growth.) S16 likewise chose ANOVA, justifying this with the garbled argument that this would "analyse the covariance between cats under different conditions." S5 saw similarities between this item and the previous one (HT), which was a linear regression problem. (The only similarity is that in both the data are presented in a flat horizontal table!)

ANOVA

Four items involved the use of analysis of variance, in a one-way design, in a two-way design, with repeated measures, and with covariance adjustment.

Item LR(9): One-way ANOVA

Six of the 12 students attempting this item (comparing reaction times of three different groups under differing ambient light conditions) correctly identified this as a one-way ANOVA design. S8 identified the salient features of the situation: "parametric ratio data, one independent variable, three levels." S1 justified his choice on the grounds that one was making "several comparisons, ANOVA is quicker than t-tests." S14, S17 and S23 made similar comments about comparing three groups, with S17 noting that the situation involved studying "the effect of one variable on another, no rows down the side, so one-way." This student also referred to assumptions underlying the use of ANOVA, and displayed a misconception by arguing that the small sample size might make a nonparametric test more appropriate. S9 answered correctly without being sure of the reasons for her choice. S10 correctly identified the item as a case of ANOVA, but misunderstood the conventions for labelling such designs by calling it a "2 × 3 ANOVA."

The remaining students did not know how to proceed, although S16 proposed using the t-test. However, this procedure was not understood either: she claimed that this would permit "finding out how far from the average the scores are, based on the bell-shaped curve." This student, in attempting the t-test for correlated samples item, displayed little understanding on that item either; in the present item, she admitted that her knowledge of t-tests was limited.

Item MA(1): Two-way ANOVA

This item, involving the effects of birth order and gender on mathematics achievement, was correctly answered by five out of 12 students, all of whom said "two-way" (or "2 × 2") ANOVA, and justified their answer. S23 said there were "four means to compare, with two divisions, gender and birth order"; S4 mentioned that the situation required "looking at two variables at the same time"; S8 said that it was possible to "look for main effects," and that 2 × 2 ANOVA was "simpler than doing two t-tests"; S15 spoke a little less accurately of "comparing two different groups."

S1 identified the situation as involving ANOVA without specifically naming the type of design, but mentioned that there were "two scales at once," which he identified as gender and birth order. Three students (S12, S16, S19) seemed to be influenced by the 2 × 2 layout of the table, but failed to consider the nature of the data, and proposed using chi-squared. S12 justified her answer with an irrelevancy: "in order to make multiple comparisons." S16 justified her answer by pointing to "the way the data is presented ... it looks like that, there are two variables with one consequence." The confusion displayed by these three students indicates the importance of learning to identify the nature of the data before deciding upon a statistical procedure; these students were simply unaware of the irrelevance of chi-squared to problems involving scores as the dependent variable. S11 proposed using regression, because the situation required "prediction, how much one factor affects another."

Item ME(17): Repeated measures ANOVA

This item presented data on the performance on five successive trials of three independent groups exposed to differing treatment conditions (differing kinds of information). This was one of the most difficult items, with only one student (S15) identifying it as a repeated measures ANOVA design because we are "comparing three groups over five occasions."

S7 was partly correct, recognising that ANOVA was needed, but uncertain of how to deal with the five trials; he proposed finding the mean performance over the five trials. S8 suggested using one-way ANOVA on the data of the last trial only, and mentioned "repeated measures" but was uncertain about whether this was correct. S4 also proposed using ANOVA, but stated, incorrectly, that there were "unequal replications," possibly a result of misunderstanding the data presentation. The complete set of data was not presented in the table, and he interpreted this to mean that there were "missing data"; he commented, however, that there was a way of dealing with this problem. S3 could not decide, but mentioned ANOVA, in particular "3-way ANOVA," and then rejected this for the irrelevant reason that "there's only one dependent variable"; she also mentioned "one-way ANOVA because there is only one dependent variable." Both suggestions indicate misunderstanding of the meaning of "one-way" and "3-way." S1 answered, "t-test, in order to compare"; S14 also selected the t-test, without elaboration. S10 proposed using "correlation to find the relationship between the number of errors and the [amount of ?] information," while S11 suggested regression "because we are looking for a relation."

A few students in addition to S4 found the fact that only a few scores were listed in the table to be a source of confusion: "the information is presented obscurely" (S3); "don't know what the data mean" (S1); "there appears to be missing data, it's misleading" (S15).

Item MU(14): ANCOVA

This item presented data on the essay-writing ability of students in various university faculties, together with measures of their prior ability; the research design required was a one-way ANOVA with a covariance adjustment. This procedure is probably not included in the curriculum of most introductory statistics courses. Only one student (S3) gave a correct answer, and explained that the covariate allowed one to "take account of prior differences and then compare the two sets of scores and two faculties." (This last comment indicates a partial misunderstanding of the situation.)

S22 proposed using MANOVA, which could be employed even if it is unnecessary (there being only one dependent variable); however, her rather confused explanation, with references to "factorial analysis ... before/after testing situation" displayed little understanding of the research problem. Other students proposed using correlation or MANOVA (S1), hierarchical regression (S2), z-scores followed "with some kind of ANOVA" (S7, S15), ANOVA (S10), comparing means and if they were significant, correlations (S11), creating "a formula to control for Y variable (see how much X varies when Y varies) and then use a t-test for independent samples to make comparisons" (S14) or "t-test for independent samples" (S21).

Item DV(16): Post-hoc comparison of means

In this item, data from a 2 × 3 factorial ANOVA design were presented, and the information given that the ANOVA had yielded a significant F ratio. The item required the nomination of any post-hoc test for comparing the means of the six cells. Only two of 12 students were able to answer this; S7 named Tukey's test (noting that "we have already done an ANOVA and have the F ratio"), as did S15, who added that this was "a post-hoc test." Two students proposed the use of the t-test (S23: "independent samples"; S14: "don't know which kind"); this answer demonstrates a lack of awareness of the problem of multiple comparisons which tests such as Tukey's are designed to overcome. Four students (S3, S4, S10, S22) answered "ANOVA," which simply repeats information already given in the item. (S10 seemed to be unaware that `ANOVA' and `analysis of variance' were synonymous, and admitted that she was confused about the meaning of `F-ratio.')

Comparisons of Frequencies Using Chi-squared

Item SH(12): Chi-squared test of association

This item, which presented frequency data on the association between smoking and dying, was correctly answered by six of the eleven students who attempted it. These students (S3, S6, S7, S18, S20, S21) were all able to articulate reasons for their choice by mentioning terms such as "frequency test," "dichotomised categories," "association between categorical data," "2 × 2 table ... sufficient items in the cells" or "nominal, dichotomous data."

The other students displayed a variety of incorrect responses. S3 answered "correlation," explaining that this was how one measured "association." There are of course correlational techniques for determining the strength of association between categorical data, but she made no mention of the nature of the data to indicate that this was what she had in mind. S8 and S9 also referred to correlation in order to measure association; S8 noted that the data were categorical, briefly considered chi-squared, and he then rejected this idea, arguing that the data did "not suggest that, you need proportions, and anyway, this would not tell you about association, only about significance, whether it was a chance event." S13 had some sense of the layout of the data, mentioning "cross-tabulation, correlation, 2 × 2 contingency table," but was unable to propose a specific statistical test.

The remaining two students displayed no understanding of the nature of the data or the research question being asked. S1 proposed using one-way ANOVA, admitting that "the layout of the table" was difficult to understand: "the numbers didn't make sense." S14 proposed using a t-test (he didn't know which type) "in order to find a significant difference to see if the association is significant."

Item LB(3): Goodness-of-fit chi-squared

This item called for the use of chi-squared to evaluate whether a frequency distribution (light-bulb life) fitted a normal distribution. This was a difficult item, with only one of 12 students (S18) giving a detailed, correct answer. He recognised the small frequencies in some of the class intervals, and proposed forming the intervals "into eight categories" and then using chi-squared "to compare the expected number in each category with the actual number."

Several students proposed using visual, graph-plotting techniques, by hand or by computer, to compare the distributions. S19 pointed to "the need to compare with a bell-shaped curve to see if it conforms sufficiently" and said that this could be plotted by hand, or through the use of SPSS. Others (S1, S4, S5, S8 S15, S23) made similar references to histograms and visual comparisons. S9 didn't know how to proceed, but mentioned "frequency distribution ... compare with a normal curve (bell-shaped curve) to check that it's not skewed, need to find means first." Two students (S16, S17) suggested using t-tests because of the "normal distribution." S17 had initial difficulty in understanding the meaning of the term "frequency" in the context of this item; she made some reference to comparing "expected values to actual values," but then indicated a lack of understanding of the meaning of these terms by talking vaguely about "creating expected values in the future."

Nonparametric Statistics

Item GB(8): Sign test

This item, containing data on the preponderance of male or female births in successive years, could have been answered correctly by mentioning the sign test, or the binomial distribution. Chi-squared would also have been acceptable. This was the second most difficult item in the set, with only one of 12 students offering a correct answer. S21 recognised the "particular data type, male vs. female, rather than numbers," proposed labelling these "with plus or minus to indicate preponderance" but was "not sure in a sign test if you can assign the plusses and minuses yourself, or if they should be inherently assigned."

Everyone else had little or no idea of how to proceed. S6 recognised that the data were categorical, while S13 proposed an unspecified "probability" test. S9, triggered off by the item's request for a test of statistical significance, proposed using a t-test, as did S10 and S16, while S1 used a similar argument to defend his choice of one-way ANOVA, noting [quite incorrectly] that there were "a lot of data to look at, so the t-test would be time-consuming." S5 exemplified the puzzlement of students unfamiliar with tests on categorical data: she could "see that you need to do a statistical test for significance, there are two groups ... but I don't know what. We need more information to answer the question, e.g., percentage or numbers of males and females; nothing is indicated by the data presented." S8 similarly stated that "the question is difficult, not sure what it's getting at" and expressed the wish for numerical data: "one could analyse the numbers of boys and girls in each year, and figure out the number of years when male births predominate, and see whether they are significantly different from what's expected by chance, 50%," but saw that this was not consistent with the nature of the data provided. S5, similarly asserted that "we don't have enough information to find out." The responses to this item point to widespread lack of understanding of the nexus between data type and appropriate statistical test.

Item AB(4): Mann-Whitney U

This item, comparing the behaviour of an experimental and a control group where the data were ranks, called for the use of the nonparametric Mann-Whitney U test. Since such statistical tests are probably not emphasised much in introductory statistics courses, it is therefore not surprising to find that this item also proved difficult. Of 12 students, only two students answered this item correctly. S13 and S15 both noted that "ranked data" had been presented, with the latter student adding that "you need a nonparametric test." S8 noted correctly that the data were ranks and that a nonparametric test was required, but could not name it.

S4, S9 and S20 realised that the data were ranks but could not name a procedure. S9 had considered t-test or ANOVA but stated that she "couldn't recall what a t-test does precisely" and also rejected ANOVA for the wrong reason: "because there would have to be an interaction which doesn't come into this problem."

For three students, the ranked nature of the data triggered associations with the Spearman r, the nature of the data taking precedence in these students' minds over the research question. S21 said that "you need a measure of correlation"; S23 proposed a "correlation for ranked data, don't know the name, to find a relation between watching aggressive acts and performing aggressive behaviours" [a misreading of the research problem]. S12 confused two ideas: she observed that the data were ranked, proposed "Spearman's ranking test," and then argued that the situation required "looking for differences between groups, everything is assigned a plus or minus."

Return to Gardner and Hudson Paper


JSE Homepage | Subscription Information | Current Issue | JSE Archive (1993-1998) | Data Archive | Index | Search JSE | JSE Information Service | Editorial Board | Information for Authors | Contact JSE | ASA Publications