A guide to passing the A/B test interview question in tech companies

Hey all,

I’m a Sr. Analytics Data Scientist at a large tech firm (not FAANG) and I conduct about ~3 interviews per week. I wanted to share my advice on how to pass A/B test interview questions as this is an area I commonly see candidates get dinged. Hope it helps.

Product analytics and data scientist interviews at tech companies often include an A/B testing component. Here is my framework on how to answer A/B testing interview questions. Please note that this is not necessarily a guide to design a good A/B test. Rather, it is a guide to help you convince an interviewer that you know how to design A/B tests.

A/B Test Interview Framework

Imagine during the interview that you get asked “Walk me through how you would A/B test this new feature?”. This framework will help you pass these types of questions.

Phase 1: Set the context for the experiment. Why do we want to AB test, what is our goal, what do we want to measure?

  1. The first step is to clarify the purpose and value of the experiment with the interviewer. Is it even worth running an A/B test? Interviewers want to know that the candidate can tie experiments to business goals.
  2. Specify what exactly is the treatment, and what hypothesis are we testing? Too often I see candidates fail to specify what the treatment is, and what is the hypothesis that they want to test. It’s important to spell this out for your interviewer.
  3. After specifying the treatment and the hypothesis, you need to define the metrics that you will track and measure.
  • Success metrics: Identify at least 2-3 candidate success metrics. Then narrow it down to one and propose it to the interviewer to get their thoughts.
  • Guardrail metrics: Guardrail metrics are metrics that you do not want to harm. You don’t necessarily want to improve them, but you definitely don’t want to harm them. Come up with 2-4 of these.
  • Tracking metrics: Tracking metrics help explain the movement in the success metrics. Come up with 1-4 of these.

Phase 2: How do we design the experiment to measure what we want to measure?

  1. Now that you have your treatment, hypothesis, and metrics, the next step is to determine the unit of randomization for the experiment, and when each unit will enter the experiment. You should pick a unit of randomization such that you can measure success your metrics, avoid interference and network effects, and consider user experience.
  • As a simple example, let’s say you want to test a treatment that changes the color of the checkout button on an ecommerce website from blue to green. How would you randomize this? You could randomize at the user level and say that every person that visits your website will be randomized into the treatment or control group. Another way would be to randomize at the session level, or even at the checkout page level.
  • When each unit will enter the experiment is also important. Using the example above, you could have a person enter the experiment as soon as they visit the website. However, many users will not get all the way to the checkout page so you will end up with a lot of users who never even got a chance to see your treatment, which will dilute your experiment. In this case, it might make sense to have a person enter the experiment once they reach the checkout page. You want to choose your unit of randomization and when they will enter the experiment such that you have minimal dilution. In a perfect world, every unit would have the chance to be exposed to your treatment.
  1. The next step is to conduct a power analysis to determine the number of observations required and how long to run the experiment. You can either state that you would conduct a power analysis using an alpha of 0.05 and power of 80%, or ask the interviewer of the company has standards you should use.
  • I’m not going to go into how to calculate power here, but know that in any AB test interview question, you will have to mention power. For some companies, and in junior roles, just mentioning this will be good enough. Other companies, especially for more senior roles, might ask you more specifics about how to calculate power.
  1. Next, you need to determine which statistical test(s) you will use to analyze the results. Is a simple t-test sufficient, or do you need quasi-experimental techniques like difference in differences? Do you require heteroskedastic robust standard errors or clustered standard errors?
  • The t-test and z-test of proportions are two of the most common tests.
  • If your unit of randomization is larger than your analysis unit, you may need to adjust how you calculate your standard errors.
  • You might be thinking “why would I need to use difference-in-difference in an AB test”? In my experience, this is common when doing a geography based randomization on a relatively small sample size. Let’s say that you want to randomize by city in the state of California. It’s likely that even though you are randomizing which cities are in the treatment and control groups, that your two groups will have pre-existing biases. A common solution is to use difference-in-difference. I’m not saying this is right or wrong, but it’s a common solution that I have seen in tech companies.
  1. Final considerations for the experiment design:
  • Are you testing multiple metrics? If so, account for that in your analysis. A really common academic answer is the Bonferonni correction. I’ve never seen anyone use it in real life though, because it is too conservative. A more common way is to control the False Discovery Rate. You can google this. Alternatively, the book Trustworthy Online Controlled Experiments by Ron Kohavi discusses how to do this (note: this is an affiliate link).
  • Do any stakeholders need to be informed about the experiment?
  • Are there any novelty effects or change aversion that could impact interpretation?

Phase 3: The experiment is over. Now what?

  1. After you “run” the A/B test, you now have some data. Consider what recommendations you can make from them. What insights can you derive to take actionable steps for the business? Speaking to this will earn you brownie points with the interviewer.
  • For example, can you think of some useful ways to segment your experiment data to determine whether there were heterogeneous treatment effects?

Common follow-up questions, or “gotchas”

These are common questions that interviewers will ask to see if you really understand A/B testing.

  • Let’s say that you are mid-way through running your A/B test and the performance starts to get worse. It had a strong start but now your success metric is degrading. Why do you think this could be?
    • A common answer is novelty effect
  • Let’s say that your AB test is concluded and your chosen p-value cutoff is 0.05. However, your success metric has a p-value of 0.06. What do you do?
    • Some options are: Extend the experiment. Run the experiment again.
    • You can also say that you would discuss the risk of a false positive with your business stakeholders. It may be that the treatment doesn’t have much downside, so the company is OK with rolling out the feature, even if there is no true improvement. However, this is a discussion that needs to be had with all relevant stakeholders and as a data scientist or product analyst, you need to help quantify the risk of rolling out a false positive treatment.
  • Your success metric was stat sig positive, but one of your guardrail metrics was harmed. What do you do?
    • Investigate the cause of the guardrail metric dropping. Once the cause is identified, work with the product manager or business stakeholders to update the treatment such that hopefully the guardrail will not be harmed, and run the experiment again.
    • Alternatively, see if there is a segment of the population where the guardrail metric was not harmed. Release the treatment to only this population segment.
  • Your success metric ended up being stat sig negative. How would you diagnose this?

I know this is really long but honestly, most of the steps I listed could be an entire blog post by itself. If you don’t understand anything, I encourage you to do some more research about it, or get the book that I linked above (I’ve read it 3 times through myself). Lastly, don’t feel like you need to be an A/B test expert to pass the interview. We hire folks who have no A/B testing experience but can demonstrate framework of designing AB tests such as the one I have just laid out. Good luck!

6 Likes

I also conduct AB testing interviews frequently, and this is a generally excellent guide! In addition, I frequently inquire about Bayesian approaches to AB testing, how to manage pre-experimental imbalance (such as CUPED, though I’m not very concerned about it), and how to frame AB testing analysis as a regression problem.

Assume that your AB test has been completed and that the 0.05 p-value criterion was selected. But the p-value for your success statistic is 0.06. How do you proceed?

I would expect, if I asked, a response along the lines of “who cares lol, just launch it.”

Another query: is the experiment’s success required for someone higher up in the promotion document? How do we interpret a negative outcome till it turns out well?

5 Likes

I’d hope for an answer like “who cares lol, just launch it.”

I feel like this is a trap question, depending on the mood and the personality of the interviewers. That answer of “yeah, go for launch” even when it’s above >0.05 can provide the interviewer an opportunity to sratch you out.

4 Likes

If I were interviewing I would expect an answer like “given the threshold was adequately set and no data issues exist hindering the statistic we wouldn’t launch, but one could question if that threshold is appropriate for the business”

The issue is have with “close enough” answers is it makes decisions fuzzy and unrepeatable. In the OP, it seems the alpha should just be 0.9 if we are going to accept 0.06 anyway.

2 Likes

In real scenarios, there is always going to be fuzziness. 0.05 is an ambiguous cutoff anyway. I’d rather have someone who can consider the context and adapt to it. If rolling out the change has limited downsides, it’s fine to discuss adjusting the threshold, and perhaps discuss whether this was the correct threshold to begin with.

2 Likes

Yeah, I’m also confused by this one…

1 Like

It’s classic “the difference between stat sig and not stat sig is not itself stat sig”. Reducing business decisions to the results of a statistical test is itself a very fuzzy process, so there’s no reason to pretend to be so rigorous about stat sig thresholds in most real cases.

If you were conducting a ton of A/B tests with very similar methodologies, powers, and very clear cost functions, then a binary threshold can be justified. In reality those things are rarely clear, and they certainly don’t justify anything about 0.05 precisely.