Effect Size, Sample Size, & Statistical Power in HCI Research
This page attempts to answer the most frequent statistics questions asked by my students and peers (see below). Since I have answered most of these questions in my PhD dissertation (in the context of HCI research), thought it might be useful if I include the most relevant parts in this page. Note that I am not a statistician, but frequently use statistics in my research. Please feel free to email me your feedback and comments.
- How to calculate/determine sample size for a user study?
- What is effect size?
- What does a small effect size mean?
- What is the power of the study?
- What is statistical power?
- How do I calculate statistical power?
- How does effect size affect power?
- How do you increase the power of a statistical test?
The content of this page has been extended to a book chapter. If you find the information useful or want to convince the reviewers that you used the correct statistical tests, please cite the chapter, as follows.
Ahmed Sabbir Arif. 2021. Statistical Grounding. Intelligent Computing for Interactive System Design: Statistics, Digital Signal Processing, and Machine Learning in Practice (1st ed.). Association for Computing Machinery, New York, NY, USA, 59–99. https://doi.org/10.1145/3447404.3447410
Appendices
The appendices are sorted alphabetically.
A1. Effect Size (\(\eta^2\))
Eta-squared (\(\eta^2\)) is a measure of effect size (Cohen, 1988). This measure is intended for ANOVAs, instead of Cohen's \(d\), which was designed for t-tests. \(\eta^2\) describes the ratio of variance explained in the dependent variable by an independent variable while controlling for other independent variables. It is a biased estimator of the variance explained by the model in the population. As the sample size (\(N\)) gets larger the amount of bias decreases. Stated simply, it tells us how much an independent variable has affected the dependent variable in an empirical study. On average \(\eta^2\) overestimates the variance explained in the population. It ranges between 0 and 1. Cohen (1988) offered conservative threshold criteria for \(\eta^2\), where \(\eta^2\) = 0.0099 constitutes a small, \(\eta^2\) = 0.0588 a medium, and \(\eta^2\) = 0.1379 a large effect 15.
The following equation was used in this dissertation to calculate \(\eta^2\).
Equation (20)
$$\eta^2 = \frac{SS_{Treatment}}{SS_{Total}}$$
Here, \(SS_{Treatment}\) is the sum of squares for the effect of interest, and \(SS_{Total}\) is the total sum of squares for all effects, interactions, and errors in the ANOVA.
A2. Phrase Set
Almost all text entry user studies present participants with preselected short phrases of text, which are retrieved randomly from a set and are presented to participants one at a time to enter (see Section 2.1).
During the studies reported in this document, participants entered short English phrases from MacKenzie and Soukoreff's (2003) phrase set. The phases used in the set are moderate in length (28.61 characters on average), easy to remember, and representative of the English language. The phrases do not contain any numeric and special characters. MacKenzie and Soukoreff (2003) argued that it is best to exclude these characters from the interaction, as they do not assist to differentiate the. A few phrases contained uppercase characters, which were converted to lowercases for the same reason – as with the investigated techniques these characters are inputted using the same keys. This corpus's high correlation with the character frequency in the English language also encouraged researchers to use it in almost all recent text entry studies (see Section 2.4). The corpus is available online16.
A3. Sample Size (\(N\))
It is possible to calculate the power of statistical tests prior to a study (a priori) to determine the sample size (\(N\)), that is, the number of participants required. However, a priori power analysis is rarely done in human-computer interaction research, since it requires knowing the variance in a sample and the difference in the means on the dependent variable (effect size) before the data are collected (MacKenzie, 2013). Thus, the recommended procedure is to study the existing literature (MacKenzie, 2013). If a similar study reports statistically significant results with a particular number of participants, then using that many participants is a reasonable choice. The user studies reported in this document follow this recommendation. Section 2.4.3, particularly Table 1, presented results from existing text entry studies, similar to the ones reported here, along with their sample size.
A4. Statistical Power (\(1-\beta\))
This work used post-hoc power analysis, as motivated in Appendix A3.
Cohen (1992) defined: "The power of a statistical test of a null hypothesis (\(H_0\)) is the probability that the H0 will be rejected when it is false, that is, the probability of obtaining a statistically significant result".
Statistical power analysis exploits the mathematical relationship between the four variables in statistical inference: power (\(1-\beta\)), false positive rate (\(\alpha\)), sample size (\(N\)), and effect size (\(f\)). The relationship permits one to determine the value of one variable when the other variable values are known. Based on this, post-hoc power analysis detects a hypothesized \(1-\beta\) for specified \(\alpha\), \(N\), and \(f\). The false positive rate \(\alpha\) is also referred to as the probability of a Type I error, while \(\beta\) is referred to as false negative rate or the probability of a Type II error.
Cohen (1992) suggested the use of a threshold of .80, that is, \(\beta\) = .20, for a level of desired power when no other basis for setting the value is available. The reason behind this is that it is more misleading to make a false positive claim (larger \(\alpha\)) than a false negative claim (larger \(\beta\)). As the convention for the significance level in HCI is to use a .05 threshold, the use of .80 for desired power (\(\beta\) = .20) makes \(\beta\) four times more likely than \(\alpha\), which is a reasonable reflection of their relative importance (Cohen, 1992).
This work calculated the power of a statistical test using the G*Power software package (Cunningham and McCrum-Gardner, 2007). For this purpose, the correlation among the repeated measures, \(\alpha\), \(N\), and \(f\) were calculated individually for each test and input into the package to obtain the statistical power, \(1-\beta\). Cohen's \(f\) was calculated from \(\eta^2\), see Appendix A1 above, using the following equation (Faul et al., 2007).
Equation (21)
$$f=\sqrt{\frac{\eta^2}{1-\eta^2}}$$
References
- Cohen, J. (1988). Statistical power analysis for the behavior sciences (2nd Ed.). Lawrence Erlbaum, Hillsdale, NJ, USA, 283.
- Cohen, J. (1992) Statistical power analysis. Current directions in psychological science 1, 3, 98-101.
- Cunningham, J. B. and McCrum-Gardner, E. (2007) Power, effect and sample size using GPower: practical issues for researchers and members of research ethics committees. Evidence Based Midwifery 5, 132-136.
- Faul, F., Erdfelder, E., Lang, A., and Buchner, A. (2007) G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39, 2, 175-191.
- MacKenzie, I. S. (2013). Designing HCI Experiments. In Human-computer interaction: an empirical research perspective. Morgan Kaufmann, Waltham, MA, USA, 157-189.
- MacKenzie, I. S. and Soukoreff, R. W. (2003) Phrase sets for evaluating text entry techniques. In CHI '03 Extended Abstracts on Human Factors in Computing Systems (CHI '03). ACM, New York, NY, 754-755.
- Note that the threshold values to distinguish between small, medium, and large effects are different for other effect size measures such as Cohen's \(d\), \(f\), \(f^2\), Glass's \(\Delta\), and Hedges's \(g\). The following Wikipedia page provides more information and the corresponding threshold values for these measures: http://en.wikipedia.org/wiki/Effect_size.
- http://www.yorku.ca/mack/PhraseSets.zip