A Brief Note on Selecting and Reporting the Right Statistical Test
Selecting the appropriate statistical technique for hypothesis testing hinges on both the study design and the nature of the data. This cheat sheet encapsulates the most prevalent pairings. If your intended combination is not covered, consulting an expert is advisable. While there are myriad techniques for each pairing, this guide focuses on those most often employed in Human-Computer Interaction (HCI) research. Their inclusion is based on their simplicity and/or their reliability, as they tend to provide the highest power for a specified significance level. While I am not a trained statistician, I frequently incorporate statistics into my research. I welcome any feedback or comments; please don't hesitate to reach out via email.
If the information on this page proves helpful, or if you wish to reinforce to reviewers that you have selected the appropriate statistical tests, kindly consider referencing this page using the citation provided below.
Ahmed Sabbir Arif. 2017. A Brief Note on Selecting and Reporting the Right Statistical Test. University of California, Merced, United States. https://www.theiilab.com/notes/HypothesisTesting.html
Terminologies
- Independent Variables (\( \textit{IV} \)), often termed as "factors," represent the specific conditions that a researcher modifies to study their impact on the dependent variables, such as experimenting with various text entry techniques.
- Conditions refer to the distinct variations or levels within a single (\( \textit{DV} \)), like diverse methods of text input.
- Dependent Variables (\( \textit{DV} \)) are the outcomes or measurements a researcher is keen on observing, for instance, typing speed in words per minute or the error rate.
- The degrees of freedom (\( \textit{df} \)) denote the count of values in a statistical analysis that are free to vary.
Appropriate Statistical Tests
Data Type | Normality Assumption | Sphericity Assumption | Statistical Test |
---|---|---|---|
Nominal Data | — | — | Non-parametric Test |
Ordinal Data | — | — | Non-parametric Test |
Interval Data | Yes | Yes | Parametric Test |
No | Non-parametric Test | ||
No | — | Non-parametric Test | |
Ratio Data | Yes | Yes | Parametric Test |
No | Non-parametric Test | ||
No | — | Non-parametric Test |
Normality Tests
Researchers evaluate the null hypothesis, denoted as \( H_0 \), which posits that a specific data sample originates from a population with a normal distribution. To help with this, here are four widely-used tests to gauge data normality:
- Shapiro-Wilk test
- Anderson-Darling test
- D'Agostino's K-squared test
- Martinez-Iglewicz test
How to Report: Examples
- The Shapiro-Wilk test indicated that the sample likely originates from a normally distributed population.
- Using the Martinez-Iglewicz test, it was determined that the residuals of the response variable follow a normal distribution.
Non-Parametric Tests
Non-parametric tests are typically applied to data from nominal categories, questionnaires, and rating scales.
Variable | Study Design | Independent Variable | |
---|---|---|---|
2 Conditions | 2+ Conditions | ||
Nominal | Within-subjects Correlated Observations |
McNemar's Test | Chi-Squared (\( \chi^2 \)) Test |
Between-subjects Independent Observations |
Chi-Squared (\( \chi^2 \)) Test Fisher's Exact Test |
Chi-Squared (\( \chi^2 \)) Test | |
Ordinal and
Non-Normal Quantitative |
Within-subjects Correlated Observations |
Wilcoxon Signed-Rank Test | Friedman Test |
Between-subjects Independent Observations |
Mann-Whitney \( U \) Test | Kruskal-Wallis Test |
How to Report: Examples
Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.
- A [name of the test] identified a significant difference between the [independent variables].
- There was a significant effect of [independent variable] on [dependent variable].
- A [name of the test] identified a significant effect of [independent variable] on [dependent variable].
- A [name of the test] on the data identified significance with respect to the [dependent variable].
Statistical Procedure | Example |
---|---|
McNemar's Test | \( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \) |
Chi-Squared Test | \( \chi^2 = 26.7, \textit{df} = 2, p \lt .05 \) |
Fisher's Exact Test | \( p \lt .001 \) |
Wilcoxon Signed-Rank Test | \( z = -3.06, p \lt .005 \) |
Friedman Test | \( \chi^2 = 20.67, \textit{df} = 2, p \lt .0005 \) |
Mann-Whitney U Test | \( U = 75.5, Z = -0.523, p \lt .05 \) |
Kruskal-Wallis Test | \( H = 4.61, \textit{df} = 1, p \lt .05 \) |
Traditionally, \( p \)-values are rounded to one of the following values: \( \{.05, .01, .005, .001, .0005, .0001\} \) (MacKenzie, 2015). Exact \( p \)-values are only noted when the null hypothesis \( H_0 \) is on the brink of acceptance, for example, \( p = .051 \). Digits should never be italicized. Since \( p \)-values always fall between \( 0 \) and \( 1 \), there is no need for a \( 0 \) before the decimal point.
Parametric Tests
Parametric tests are usually performed on normally distributed quantitative data.
Variable | Study Design | Independent Variable | ||
---|---|---|---|---|
2 Conditions | 2+ Conditions | |||
Normal Quantitative | Within-subjects Correlated Observations |
Paired Samples T-test | Repeated-measures ANOVA | Mixed-design ANOVA |
Between-subjects Independent Observations |
Independent Samples T-test | Between-subjects ANOVA |
Different Types of Analysis of Variance (ANOVA) & When to Use Them
It is generally not advised to use more than two Independent Variables, as illustrated by the grayed-out cells below. Doing so increases the number of conditions, making them challenging to manage. The following table displays appropriate parametric tests for experiments with more than two conditions.
Study Design | Between-subjects Independent Observations |
||||
---|---|---|---|---|---|
Independent Variables | 0 | 1 | 2 | 2+ | |
Within-subjects Correlated Observations |
0 | NA | One-way Between-subjects ANOVA | Two-way Between-subjects ANOVA | MANOVA |
1 | One-way Repeated-measures ANOVA | Two-way Mixed-design ANOVA | Mixed-design ANOVA | Mixed-design ANOVA | |
2 | Two-way Repeated-measures ANOVA | Mixed-design ANOVA | Mixed-design ANOVA | Mixed-design ANOVA | |
2+ | MANOVA | Mixed-design ANOVA | Mixed-design ANOVA | Mixed-design ANOVA |
How to Report: Examples
Model your statements after the examples below, with results provided in parentheses, as shown in the table beneath.
- There was a significant effect of [independent variable] on [dependent variable].
- A(n) [name of the test] identified a significant effect of [independent variable] on [dependent variable].
Statistical Procedure | Example |
---|---|
Paired or Independent-samples T-test | \( t_{54} = 5.43, p \lt .001 \) |
\( t = 5.43, \textit{df} = 54, p \lt .001 \) | |
All Variants of ANOVA | \( F_{1,11} = 38.65, p \lt .0001 \) |
\( F(1,11) = 38.65, p \lt .0001 \) |
Traditionally, \( p \)-values are rounded to one of the following values: \( \{.05, .01, .005, .001, .0005, .0001\} \) (MacKenzie, 2015). Exact \( p \)-values are only noted when the null hypothesis \( H_0 \) is on the brink of acceptance, for example, \( p = .051 \). Digits should never be italicized. Since \( p \)-values always fall between \( 0 \) and \( 1 \), there is no need for a \( 0 \) before the decimal point. The subscript values represent the \( \textit{df} \).
Effect Size
It is advisable to include the effect size when presenting statistically significant results, such as \( F_{1,11} = 38.6, p < .0001, \eta^2 = 0.4 \). Effect size serves as a quantitative indicator of the magnitude of a particular phenomenon, aimed at answering a specific research question. Within the realm of statistical tests, it gauges the intensity of the association between independent and dependent variables, representing it on a numerical scale (Arif, 2021).
The following table outlines prevalent effect size measures tailored for various statistical significance tests. It also provides the respective equations for computing them and offers guidelines for interpreting the resulting values.
Type | Statistical Procedure | Effect Size Measure | Interpretation |
---|---|---|---|
Non-parametric | McNemar's Test |
Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \) \( A \) is the probability (count) of the event occurring \( B \) is the probability (count) of the event not occurring \( \phi = 0 \) to \( \infty \) |
\( OR = 1.68 \) small \( OR = 3.47 \) medium \( OR = 6.71 \) large (Chen et al., 2010) |
Chi-Squared Test |
Phi coefficient \( \phi = \sqrt{ \frac{ \chi^2 }{n} } \) \( n \) is the number of observations \( \phi = 0 \) to \( 1 \) |
\( \phi = 0.1 \) small \( \phi = 0.3 \) medium \( \phi = 0.5 \) large (Cohen, 1988) |
|
Fisher's Exact Test |
Odds ratio \( (OR) = \max{\bigl( \frac{A}{B} },{ \frac{B}{A} \bigr)} \) \( A \) is the probability (count) of the event occurring \( B \) is the probability (count) of the event not occurring \( \phi = 0 \) to \( \infty \) |
\( OR = 1.68 \) small \( OR = 3.47 \) medium \( OR = 6.71 \) large (Chen et al., 2010) |
|
Wilcoxon Signed-Rank Test |
Pearson's correlation coefficient \( r = \frac{n (\sum{xy}) - (\sum{x}) (\sum{y}) }{\sqrt{[ n \sum{x^2} - (\sum{x})^2 ] [n \sum{y^2} - (\sum{y})^2] } } \) \( x \) and \( y \) are the variables \( n \) is the total number of pairs Alternative \(r = \frac{\sum{Z_x Z_y}}{n} \) \( z_x \) and \( Z_y \) are \( z \) scorers \( n \) is the total number of pairs \( r = -1 \) to \( 1 \) |
\( r = 0.1 \) small \( r = 0.3 \) medium \( r = 0.5 \) large (Cohen, 1988) |
|
Friedman Test |
Kendall's coefficient of concordance \( W = \frac{12S}{m^2 (n^3 - n)} \) \( S \) is the sum of squared deviations \( m \) is the number of judges (raters) \( n \) is the total number of objects being ranked \( W = 0 \) to \( 1 \) |
\( W = 0.1 \) small \( W = 0.3 \) medium \( W = 0.5 \) large (Cohen, 1988) |
|
Mann-Whitney U Test |
Rank-biserial correlation \( r = 1 – \frac{2U}{n_1 n_2} \)
\( U \) is the smaller of the two \( U \) values \( n_1 \) is the number of observations in group 1 \( n_2 \) is the number of observations in group 2 \( r = 0 \) to \( 1 \) |
\( r = 0.56 \) small \( r = 0.64 \) medium \( r = 0.71 \) large (Ruscio, 2008) |
|
Kruskal-Wallis Test |
Eta-squared
\( (\eta^2) = \frac{ H - k + 1}{ n - k} \) \( H \) is the \( H \) score obtained in the test \( k \) is the number of groups \( n \) is the number of observations \( \eta^2 = 0.01 \) to \( 1 \) |
\( \eta^2 = 0.01 \) small \( \eta^2 = 0.06 \) medium \( \eta^2 = 0.14 \) large (Cohen, 1992b) |
|
Parametric | Paired or Independent-samples T-test |
Cohen's \( d = \frac{\bar{x}_1 - \bar{x}_2}{s} \) \( \bar{x}_1 - \bar{x}_2 \) is the difference between two means \( s \) is the pooled standard deviation \( d = 0.01 \) to \( \infty \) |
\( d = 0.2 \) small \( d = 0.5 \) medium \( d = 0.8 \) large (Cohen, 1992a) |
All Variants of ANOVA |
Eta-squared
\( (\eta^2) = \frac{SS_{Treatment}}{SS_{Total}} \) \( SS_{Treatment} \) is the sum of squares for the effect of interest \( SS_{Total} \) is the total sum of squares for all effects, interactions, errors \( \eta^2 = 0.01 \) to \( 1 \) |
\( \eta^2 = 0.01 \) small \( \eta^2 = 0.06 \) medium \( \eta^2 = 0.14 \) large (Cohen, 1992b) |
Post-hoc Multiple Comparisons Tests
While global hypothesis tests provide insights regarding the overall differences between groups, post-hoc tests discern which specific groups (or conditions) manifested differences. Post-hoc analyses are conducted subsequent to the conclusion of the experiment.
It is customary to conduct multiple comparisons tests only if the null hypothesis \( H_0 \) concerning homogeneity is rejected. Notably, Hsu (1996, pp 177) posits that the outcomes of the majority of multiple comparisons tests remain valid, even if the global hypothesis test does not identify an overarching statistically significant difference in group means. An exception to this is the Fisher LSD (least significant difference) test, which is infrequently utilized in contemporary research. This particular test operates under the premise that \( H_0 \) of homogeneity has been rejected. Nonetheless, identifying statistical significance in a post-hoc analysis becomes improbable when the global test does not ascertain overall significance.
Statistical Procedure | Post-hoc Multiple-Comparison Test |
---|---|
Chi-Squared Test | The Bonferroni Procedure |
Friedman Test | Games-Howell Test |
Kruskal-Wallis Test | The Dunn's Test |
ANOVA |
Tukey-Kramer Test Newman-Keuls Method The Bonferroni Procedure |
How to Report: Examples
Model your statements after the examples below.
- A post-hoc [name of the test] revealed that the [condition/s] and [condition/s] differed significantly at p < .05.
- A post-hoc [name of the test] identified the follwoing significantly different groups: [condition/s], [condition/s], ….
- A post-hoc [name of the test] suggested [condition/s] performed significantly better, in terms of [dependent variable/s] than [condition/s].
- A post-hoc [name of the test] suggested [condition/s] was/were significantly different than [condition/s].
Related Notes
- Effect Size, Sample Size, & Statistical Power in HCI Research
- Practical Statistics for HCI by Jacob O. Wobbrock
- Statistical Methods for HCI Research by Koji Yatani
- Analysis of Variance Explained (and a tool to do it!) by I. Scott MacKenzie
References
- Ahmed Sabbir Arif. 2021. Statistical Grounding. Intelligent Computing for Interactive System Design: Statistics, Digital Signal Processing, and Machine Learning in Practice (1st ed.). Association for Computing Machinery, New York, NY, USA, 59–99. https://doi.org/10.1145/3447404.3447410
- Henian Chen ,Patricia Cohen, Sophie Chen. 2010. How Big is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies. Communications in Statistics - Simulation and Computation 39, 4, 860-864.
- Jacob Cohen. 1988. Statistical Power Analysis for the Behavior Sciences (2nd Ed.). Lawrence Erlbaum, Hillsdale, NJ, USA, 283.
- Jacob Cohen. 1992a. A Power Primer. Psychological Bulletin 112, 1, 155–159.
- Jacob Cohen. 1992b. Statistical Power Analysis. Current Directions in Psychological Science 1, 3, 98-101.
- Jason Hsu. 1996. Multiple Comparisons: Theory and Methods (Guilford School Practitioner). Chapman and Hall/CRC, London, UK.
- I. Scott MacKenzie. 2015. How to Report an F-Statistic. Retrieved February 23, 2017 from http://www.yorku.ca/mack/RN-HowToReportAnFStatistic.html
- Nornadiah Razali and Yap Bee Wah. 2011. Power Comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling Tests. Journal of Statistical Modeling and Analytics 2, 1, 21–33.
- J. Ruscio. 2008. A Probability-based Measure of Effect Size: Robustness to Base Rates and Other Factors. Psychological Methods 13, 1, 19-30.