Every user experience (UX) designer wants to be able to know that their decisions are impacting the movement of a user to an interaction. Too often, however, we tend to rely too heavily on our own instinct, and not validate how actually effective the design of the UX actually is.
Todd Yellin is VP of Product innovation at Netflix, and spoke to a packed session at South by Southwest Interaction (SXSW) about the lessons Netflix has learned in 10 years of A/B testing. While the session didn’t dive into any powerful new mechanisms for A/B testing, Yellin did provide a lot of practical examples and advice on how Netflix has used A/B testing to make strategic decisions based on quantified user behavior, rather than design instinct.
What is A/B testing?
The concept of A/B testing is to simply test design changes toward specific results. The testing often happens with random users with measurements put in place to determine which sample was the most successful in moving towards a defined goal, or taking a specific action.
As Yellin talked through their testing practices, I came to a few conclusions.
1 – Always know what you are measuring toward:
When A/B testing, the Netflix design team always measured against 2 metrics:
- What impact did this decision have on our user accounts? (If it damaged user retention… if we lost users, it was not a good design decision).
- What impact did this decision have on user viewing (we have paid accounts… now are they watching more? Increased user viewing was a success)
In some cases, Netflix would identify a tertiary measurement. But it was only put into place after those two measurements were flat.
2 – A/B Testing is the great democratizer
Disagreements come up in the design process. And quite often, the loudest or most senior person in the room will win the design disagreement. Quite often the loudest, or most senior person in the room is the least qualified to make the design decision. A/B testing of your design decisions will allow the users to have a voice at your table.
3 – Leverage the data you collect (and collect only what you’ll leverage):
Data, Yellin explains, is “Piles of excrement, with a little bit of gold”. Rather than collecting mounds and mounds of user data, they focused on:
- Age
- Gender
- Location
But even then, they increasingly found that Age and Gender were less important than actual viewing habits. Age and gender were demographics, but became somewhat useless for content discovery.
Organizations should determine what data ACTUALLY matters to their user experience, and personalization, and make that data easy to collect. Netflix actually made their key demographic data a part of their credit-card form. They were clear that it was not for credit card purposes, and were transparent on how it would be used. But they realized that making it a part of a larger, already painful, process made it easier to collect.
4 – Don’t listen to your users… watch them…
Yellin shared a real-life example from the Netflix design table. Many passionate users were writing… calling… pleading for the ability to give ratings in 1/2 star increments. To that point, a users could only give 1 – 5 stars. They heard from thousands of users who said the ability to do a 1/2 star rating would really help the accuracy of their decisions.
So…. they tested it.
While the loud users appreciated it, the silent majority did not.
Netflix looked at its core metrics of user retention, and view time, and saw no statistical impact from this decision. This is where they added a 3rd metric. Actually completing the review process. They saw a significant drop in completion of a review process, among those given the ability to do 1/2 star increments. They dropped the 1/2 stars.
5 – The smartest mind at your design table, is still an idiot.
Yellin showed a very specific example of 3 treatments of cover art for the Breaking Bad series. The 3rd option was a very compelling close up of the main character – Walter White. The other two… just weren’t as compelling. Yellin asked the room at SXSW which they though would perform the best. The room overwhelmingly agreed that the compelling face shot of White would win. This is a room of design professionals from around the world. Yellin said his design team agreed. They A/B(/C) tested the artwork and found that it wasn’t even close. The winner was a far less compelling image of an RV in the desert.
Guess: Which Breaking Bad button leads to more plays? A/B/C? (You are almost certainly wrong. It's B.) #sxsw15 pic.twitter.com/ibDyrqlQNG
— Jochen Wegner (@Jochen) March 14, 2015
The smartest person on your design team, is still less smart than your user behavior.
Yellin is quick to point out that it isn’t wise to test EVERY design decision. Small incremental changes are probably not worth the investment and potential user frustration to test. There is a point at which designers still need to be empowered to make design decisions in the absence of empirical evidence. But continued testing and analysis of user behavior can help those designers make better decisions when the data isn’t there.
Photo: Flickr: Mike K: CC