Tutorial FilesBefore we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains a hypothetical sample of 30 students who were exposed to one of two learning environments (offline or online) and one of two methods of instruction (classroom or tutor), then tested on a math assessment. Possible math scores range from 0 to 100 and indicate how well each student performed on the math assessment. Each student participated in either an offline or online learning environment and received either classroom instruction (i.e. one to many) or instruction from a personal tutor (i.e. one to one).
Beginning StepsTo begin, we need to read our dataset into R and store its contents in a variable.
- > #read the dataset into an R variable using the read.csv(file) function
- > dataTwoWayUnequalSample <- read.csv("dataset_ANOVA_TwoWayUnequalSample.csv")
- > #display the data
- > dataTwoWayUnequalSample
The first ten rows of our dataset
Unequal Sample SizesIn our study, 16 students participated in the online environment, whereas only 14 participated in the offline environment. Further, 20 students received classroom instruction, whereas only 10 received personal tutor instruction. As such, we should take action to compensate for the unequal sample sizes in order to retain the validity of our analysis. Generally, this comes down to examining the correlation between the factors and the causes of the unequal sample sizes en route to choosing whether to use weighted or unweighted means - a decision which can drastically impact the results of an ANOVA. This tutorial will demonstrate how to conduct ANOVA using both weighted and unweighted means. Thus, the ultimate decision as to the use of weighted or unweighted means is left up to each individual and his or her specific circumstances.
Weighted MeansFirst, let's suppose that we decided to go with weighted means, which take into account the correlation between our factors that results from having treatment groups with different sample sizes. A weighted mean is calculated by simply adding up all of the values and dividing by the total number of values. Consequently, we can easily derive the weighted means for each treatment group using our subset(data, condition) and mean(data) functions.
- > #use subset(data, condition) to create subsets for each treatment group
- > #offline subset
- > offlineData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$environment == "offline")
- > #online subset
- > onlineData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$environment == "online")
- > #classroom subset
- > classroomData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$instruction == "classroom")
- > #tutor subset
- > tutorData <- subset(dataTwoWayUnequalSample, dataTwoWayUnequalSample$instruction == "tutor")
- > #use mean(data) to calculate the weighted means for each treatment group
- > #offline weighted mean
- > mean(offlineData$math)
- > #online weighted mean
- > mean(onlineData$math)
- > #classroom weighted mean
- > mean(classroomData$math)
- > #tutor weighted mean
- > mean(tutorData$math)
The weighted means for the environment and instruction conditions
ANOVA using Type I Sums of SquaresWhen applying weighted means, it is suggested that we use Type I sums of squares (SS) in our ANOVA. Type I happens to be the default SS used in our standard anova(object) function, which will be used to execute our analysis. Note that in the case of two-way ANOVA, the ordering of our independent variables matters when using weighted means. Therefore, we must run our ANOVA two times, once with each independent variable taking the lead. However, the interaction effect is not affected by the ordering of the independent variables.
- > #use anova(object) to execute the Type I SS ANOVAs
- > #environment ANOVA
- > anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample))
- > #instruction ANOVA
- > anova(lm(math ~ instruction * environment, dataTwoWayUnequalSample))
The Type I SS ANOVA results. Note the differences in main effects based on the ordering of the independent variables.
These results indicate statistically insignificant main effects for both the environment and instruction variables, as well as the interaction between them.
Unweighted MeansNow let's turn to using unweighted means, which essentially ignore the correlation between the independent variables that arise from unequal sample sizes. An unweighted mean is calculated by taking the average of the individual group means. Thus, we can derive our unweighted means by summing the means of each level of our independent variables and dividing by the total number of levels. For instance, to find the unweighted mean for environment, we will add the means for our offline and online groups, then divide by two.
- > #use mean(data) and subset(data, condition) to calculate the unweighted means for each treatment group
- > #offline unweighted mean = (classroom offline mean + tutor offline mean) / 2
- (mean(subset(offlineData$math, offlineData$instruction == "classroom")) + mean(subset(offlineData$math, offlineData$instruction == "tutor"))) / 2
- > #online unweighted mean = (classroom online mean + tutor online mean) / 2
- > (mean(subset(onlineData$math, onlineData$instruction == "classroom")) + mean(subset(onlineData$math, onlineData$instruction == "tutor"))) / 2
- > #classroom unweighted mean = (offline classroom mean + online classroom mean) / 2
- > (mean(subset(classroomData$math, classroomData$environment == "offline")) + mean(subset(classroomData$math, classroomData$environment == "online"))) / 2
- > #tutor unweighted mean = (offline tutor mean + online tutor mean) / 2
- > (mean(subset(tutorData$math, tutorData$environment == "offline")) + mean(subset(tutorData$math, tutorData$environment == "online"))) / 2
The unweighted means for the environment and instruction conditions
ANOVA using Type III Sums of SquaresWhen applying unweighted means, it is suggested that we use Type III sums of squares (SS) in our ANOVA. Type III SS can be set using the type argument in the Anova(mod, type) function, which is a member of the car package.
- > #load the car package (install first, if necessary)
- > library(car)
- > #use the Anova(mod, type) function to conduct the Type III SS ANOVA
- > Anova(lm(math ~ environment * instruction, dataTwoWayUnequalSample), type = "3")
The Type III SS ANOVA results.
Once again, our ANOVA results indicate statistically insignificant main effects for both the environment and instruction variables, as well as the interaction between them. However, it is worth noting that both the means and p-values are different when using unweighted means and Type III SS compared to weighted means and Type I SS. In certain cases, this difference can be quite pronounced and lead to entirely different outcomes between the two methods. Hence, choosing the appropriate means and SS for a given analysis is a matter that should be approached with conscious consideration.
Pairwise ComparisonsNote that since our independent variables contain only two levels, there is no need to conduct follow-up comparisons. However, should you reach this point with a statistically significant independent variable of more than three levels, you could conduct pairwise comparisons in the same manner as demonstrated in the Two-Way ANOVA with Comparisons tutorial.
Complete Two-Way ANOVA with Unequal Sample Sizes ExampleTo see a complete example of how two-way ANOVA with unequal sample sizes can be conducted in R, please download the two-way ANOVA with unequal sample sizes example (.txt) file.
Rarely is a customer population made up of a homogenous group of customers who share the same attributes.
Consequently, our samples contain a mix of customers who may or may not reflect the composition of the customer population.
There are a number of variables that affect how customers think and behave toward products and services. One of the most common variables that impacts our measurements is prior experience. More than gender, age, income, and occupation, prior experience with products, software, and websites has a major impact on customer attitudes and behavior.
We see this in usability tests and surveys measuring brand attitudes. In general, the more experience study participants have had, the better their performance on tasks and the more positive their attitudes toward the product or service being tested.
So in any sample of participants in a research study, you’ll want at least to measure participants’ prior experience. Even if you aren’t planning on using this measure, you should add it as it often comes in handy when it’s analysis time.
One way researchers control for prior experience is to match the experience level of the sample with the experience level of the population. If you believe, for example, that 60% of your website visitors use the site weekly and the other 40% use it less, you can recruit participants to match that composition. You can then compute confidence intervals and run statistical comparisons (between, say, two design alternatives) and draw conclusions as to which design users perform better on or prefer. Most of our clients choose this method—matching the sample to the population—because, when you explain it to stakeholders, it makes sense to them.
You can’t always weight your sample to match the population. Even though, for example, your data shows that 30% of your mobile website users have not accessed your website in the last year, it may be difficult to find these users to participate in a study. When you need to determine which design is preferred, or to make any comparison, you don’t want the decision to be based on the improper composition of your sample.
With unbalanced samples, two approaches can mitigate and control for the effects of prior experience on your outcome measures: a weighted t-test and a Type I ANOVA. The Analysis of Variance (ANOVA) is the statistical procedure you use to compare more than two means at once. More importantly, it enables you to see the effects of multiple variables simultaneously. The ANOVA is more computationally intensive than the t-test and usually requires specialized software, such as SPSS, R, or Minitab, to conduct. You’ll also generally want the help of a statistician to assist with the setup and analysis of ANOVA results.
About the Weighted t-Test
A relatively simple method for handling weighted data is the aptly named weighted t-test. When comparing two groups with continuous data, the t-test is the recommended approach. The t-test works for large and small sample sizes and uneven group sizes, and it’s resilient to non-normal data. (We cover it extensively in Chapter 5 of Quantifying the User Experience.) While the t-test is a “workhorse” of statistical analysis, it only considers one variable when determining statistical significance. This means that you can’t compare participants’ attitudes on Design A vs Design B AND factor in their prior experience (say low experience and high experience) with your product.
However, the weighted version of the t-test does factor in a second variable. It adjusts the means and standard deviations based on how much to weight each respondent. Participants that should account for, say, 60% of the population have scores that are weighted at 60%, even if they make up, say, only 20% of your sample. You can see the computation notes in the paper by Bland and Kerry.
Using the Weighted t-Test
Here’s how the weighted t-test works.
We recently examined how users of an online retail website would react to a different design of product information. We presented two variants and wanted to see which one was statistically preferred on a number of dimensions, including comprehension and ease. 857 qualified participants were randomly assigned Design A or Design B. We assessed comprehension and ease of use using ten-point scales.
The mean, standard deviation, and sample size for both groups on a confidence question are shown in Table 1 below.
| Design Variant||Mean||StDev||N|
Table 1. Unweighted mean scores for two design variants tested.
Even though Design A had a nominally higher mean score (8.58 vs 8.37), using a standard t-test to compare the means, we find no significant difference at the alpha = .05 level of significance (p = 0.095).
However, we know that prior experience has a major impact on attitudes toward interfaces, and packed within both samples are four groups of participants, each with progressively more experience with the website.
Not only did the sample contain a heterogeneous subgroup of experience, it was not proportionally representative of the population’s experience breakdown. Table 2 shows the breakdown of the sample in Design A and Design B compared to the makeup of the user population.
|Experience Level||Design A||Design B||Population|
Table 2. Experience level for the sample of customers assigned to Design A and B, compared to the population composition.
The biggest difference is seen with experience level 4. While this group makes up half of the population, it only comprises between 41% and 42% of the sample in Design A and B.
These groups also have differing opinions about the designs they were exposed to. Table 3 shows that one of the biggest differences in attitudes was for Experience Level 4, which rated Design A .39 points higher than B. What’s more, the smallest subgroup preferred Design B over A.
|Experience Level||Design A||Design B||Difference||Population|
Table 3. The mean responses to a confidence question (higher is better), the difference in means by experience level (1 to 4) and the population composition of that experience level.
The weighted t-test creates a composite mean and standard deviation to proportionally account for the subgroup size. The updated means and standard deviations are shown in Table 4 with the original data.
Table 4. Experience level for the sample of customers assigned to Design A and B, compared to the population composition.
The results of the weighted t-test generate a p-value of .03, which is statistically significant at the alpha = .05 level of significance. You won’t always see differences in significance values between the weighted and unweighted approaches–it depends both on how disproportionate your sample is and on how much the lower-weighted groups differ from the higher-weighted groups.
With these results we can conclude both that Design A had higher ratings and that the rating difference wasn’t attributable to incorrectly proportioned sample sizes. You can also use the approach for any mediating variable (such as geography, gender, occupation), and not just for prior experience.
A quick note of caution: you should have a good reason and actual data to support using weights. Don’t just weight your data to achieve statistical significance. While many variables in your sample will differ from the population, many won’t have a large enough effect (if any effect at all) to justify weighting.
A few things to remember about weighted data.
- While many variables could affect our measures, participants’ prior experience with a product is one of the most salient.
- The weighted t-test is the statistical test to re-balance your sample. The weighted t-test adjusts means and standard deviations to generate p-values based on the correct representation.
- Using the weighted statistical test versus an unweighted statistical test doesn’t necessarily yield different conclusions.
- Have a good reason and actual data to support weighting your sample.