How to crack data science interviews at companies like Google, Amazon etc?

12 min readMar 17, 2021

Thank you for stopping by, I went through an experience of becoming a better ‘data driven decision making’ professional. I thought this profession aligns with my long term goal of being close to business, and to direct a technical team. The story goes like this,

The Context:

During the first stint of my professional life, I was in a situation where I was told what had to be done, and once someone set a goal post, I had to figure out the methods to shoot. But the questions posed to a data driven decision maker were vague. Vague questions bother me, an answer can be the response for a question, but in this field questions were responses for another question and It took some time for me to get on terms with it.

A problem:

In the process of learning about solving business problems using data, I saw that problems are complex, but solutions are simple. So what does it take to find a simple solution to a complex problem? Simplify the problem, Simple right? I realized that in any interview, spending that additional minute or two in simplifying the problem, pays you higher dividents in the longer run. It helps you control the pace of the interview. Slowing the process down initially, helps you choose the right direction, and help you avoid obvious pitfalls and traps. It took more than a handful of interviews and case discussions for me to realize that this is the most important part of any problem solving process.

Combinatorics:

Once the problem is simplified, then comes hypotheses. Brainstorm! Write down the possible reasons for the problem to even exist. For example, let’s say there are 2 kinds of box folding machines that are used in an e-commerce business. You want to know which machine is better than the other. Now there can be a set of factors/levers that can influence the overall performance of the machines. To name a few, the skill of the person operating the machine, the demand on that particular hour for a particular machine, size of the boxes required etc. Now if we want to find the best among the 2 machines, we need to control the levers such that these levers influence both the machines in a similar way, so that we can find the actual winner. Basically level the playing field, so that we find the true winner. Ensure that you write down the possible factors that might influence the outcome, when your hands are not tied yet, that is even before you see the data. Data constraints are real, the granularity/ availability/quality of data limits your vision and that handicaps your reasoning. So write down the factors influencing the problem, well before seeing the data. The different combinations of these factors influence the problem differently, beware.

Solution:

Finally the solution always will be obvious when you have spent enough time on simplifying the problem, and done enough reasoning for the factors that influence the problem. In data science language, build the solution in its simplest form (occam’s razor principle) to start with, and think of pros and cons of your solution and see if you can reinforce your solution by adding more checks in place or employ multiple methods to be safe.

The above method is a guideline which helps us tackle any business problem.

Enough of abstract, lets talk experiences. My interview preparation went for about an year, for both internship and full-time roles post my graduate studies, and typically I started with roles that align with my goal in my dream companies (dream big, Google big), tried connecting my prior experience or academic projects with the role and responsibilities, and to keep my hands warm, I solved a leetcode problem every day. Now the rest of this post is for me to share the list of questions that I would refer just before any interview, and in no means this is exhaustive, feel free to add on to this list if you wish.

Statistics :

Start with basic knowledge of random variables, summary statistics, probability, Bayes theorem, normal distribution, concept of population and sample, sampling distribution. Move on to statistical hypothesis testing concepts(t-tests, chi square test, ANOVA etc.) - confidence interval, p-value, type 1 and type 2 errors, power analysis

List of sample questions :

What is hypothesis testing? Explain in layman terms.
Define sampling distribution and standard error.
What is Central Limit Theorem? Why is it important?
What is a null hypothesis?
What is a random sample?
What is selection bias?
What is a standard deviation?
What is a p-value? Explain in layman terms.
What is a confidence interval?
How do you reject a null hypothesis?
How does sample size affect the p-value v/s the confidence interval?
Define Type-1 and Type-2 error
What is the power of a test?

Machine Learning :

Correlation, multicollinearity, regression methods, assumptions, advanced methods, performance measurement, prediction, inference etc.

List of some sample questions :

What is overfitting and underfitting?
What is the difference between supervised and unsupervised learning?
What is the difference between Bagging and Boosting?
When should we use log transformations?
When should we use Cosine distance and when should we use Euclidean, Manhattan, Jaccard distance respectively?
How can you remove multicollinearity ?
How will you use VIF to remove multicollinearity in data? What values of VIF would you use to remove?
Explain Linear Regression? What do terms p-value, coefficient and r-squared value mean and what is their significance?
Why use a bias term in Linear Regression?
What are the assumptions of Linear Regression?
How would you improve a classification model that suffers from low precision?
Why do we have L1 and L2 regularization but not L0 and L4?
What is the beta coefficient of a multivariate regression? How do you derive it and what is the non-closed form?
Can you explain how to interpret the confidence interval of a logistic regression model?
How do you handle class imbalance?
How do you find thresholds for a classifier?
What’s the difference between logistic regression and support vector machines? What’s an example of a situation where you would use one over the other
What is “random” in a random forest? If you use logistic regression instead of a decision tree in random forest, how will your results change?
Explain AUC-ROC curve
Let’s say you have a categorical variable with thousands of distinct values, how would you encode it?
How does K-means work? What kind of distance metric would you choose? What if different features have different dynamic ranges?
What are generative and discriminative algorithms? What are their strengths and weaknesses? Which type of algorithms are usually used and why?
How does a logistic regression model know what the coefficients are?
Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?
Is random weight assignment better than assigning same weights to the units in the hidden layer?
Why is gradient checking important?
Describe the criterion for a particular model selection. Why is dimension reduction important?
If you can build a perfect (100% accuracy) classification model to predict some customer behavior, what will be the problem in application?
What’s the difference between MLE and MAP inference?
How do you deal with sparse data?
What are some situations where a linear model fails?
Do you think 20 decision trees are better than a large one? Why? Why not?
Give an example of a scenario where you would use Naive Bayes over another classifier?

SQL and Python:

For the Data Analyst/Data Scientist interviews, my recommendation is to brush-up SQL at medium-high level and Python to easy-medium level on leetcode or similar platforms. Also, part of solving the question is sharing your thought process — practice articulating the thought proccess while solving. Ask clarifying questions before starting to write the code and give reasoning for the calls you are taking while solving the problem and their reasoning.

List of sample theoretical SQL :

When will ROW_NUMBER and RANK give different results? Give an example.
Is it possible for LEFT JOIN and FULL OUTER JOIN to produce the same results? Why or why not?
Why would I use DENSE_RANK instead of RANK? What about RANK instead of DENSE_RANK?
What happens if I GROUP BY a column that is not in the SELECT statement? Why does this happen?
LAG and LEAD are especially useful in what type of scenarios?
For dealing with NULL values, why would I choose to use IFNULL vs. CASE WHEN?
Do temp tables make your code cleaner and faster, one of the two, or none? Why?
When is a subquery a bad idea? A good idea?

List of sample SQL Coding Problems :

Create a new column by extracting last 2 characters of a column containing ID of 7 digit long and in string datatype
SQL use-case for practise : https://towardsdatascience.com/sql-case-study-investigating-a-drop-in-user-engagement-510b27d0cbcc
Find the cumulative sum of top 10 most profitable products of the last 6 month for customers in Seattle
SQL problems on Mode : https://mode.com/sql-tutorial/a-drop-in-user-engagement/
Challenging SQL problem : https://www.youtube.com/watch?v=sJTa7HNFN2I , https://www.youtube.com/watch?v=1gziHPyvAAk
Provided a table with user_id and dates they visited the platform, find the top 100 users with the longest continuous streak of visiting the platform as of yesterday
Given 2 tables, one with the phone numbers that Facebook sends the confirmation message to and another one with the phone numbers that confirmed the verification, write a SQL query to calculate the confirmation percentage
Given a table containing date, post_id, relationship (e.g. Friend, Group, Page), interaction (like, share etc.), and a table containing poster id and post id, calculate: how many likes were made on friend posts yesterday
Given a table with detailed customer complaint tickets of different types, calculate the share of processed tickets within each type
Provided a table with page_id, event timestamp and a flag for a state (which is on/off), find the number of pages that are currently on
Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like

List of sample python coding problems:

Write a python code to find if entries to a list have duplicate characters. What is the computation complexity of the code?
Given an Array of numbers & a target value, return indexes of two numbers such that their Absolute difference is equal to the target
Given two dates D1 & D2. count number of days, months between them?
Find 1st missing positive number in a sorted list(must do in O(1) memory & O(n) time)
Given an array a, return the indices i,j that minimize |a[i] -a[j]|
Write a function to sample from a multinomial distribution
Given an array of words and a max width parameter, format the text such that each line has exactly X characters
Write a query to randomly sample a row from a table with 100 million rows
Describe efficient ways to merge a given k sorted arrays of size n each.

Business case questions:

How would the change of prime membership fee for Amazon affect the market?
When users are navigating through the Amazon website, they are performing several actions. What is the best way to model if their next action would be a purchase?
Due to engineering constraints, the company can’t AB test a feature before launching it. How would you analyze how the feature is performing?
How do you figure out when a small uptick in sales is a fad? At what point should you consider it a trend? How does a trend take off?
We’ve all heard the tale that supermarkets have daily necessities at the back of the store to get customers to walk through the store and buy more things than they need. Is such a layout actually more profitable?
If you placed bread and butter next to each other, would you get higher sales than if you placed them on opposite shelves? What about in adjacent aisles?
If you want to trial a VR interactive experience in your retail stores, how do you pick the stores?
How do you derive more value out of flagging customer relationships?
How do you make an email newsletter more relevant?
How do you assess when an employee’s expense receipts amount to an attempt to defraud the company? How do you curb such behaviour?
Is Black Friday an inconvenient tradition, or does it actually add any value to a customer-focused company?
How would you improve engagement on Facebook?
Lyft rider cancel rates are up 5%. What could be the reason?
Suppose your dashboard was giving results till Friday but you came to work on Monday and noticed that the dashboard is not working. How would you find and fix the issue?
You launch a NEW product in a New region and how do you predict it will grow/fail ?

a) If your product is not growing, how will find what are the factors impacting it?

b) How will you compare it against the similar products?

Behavioral Questions

Use the STAR approach to answer the questions.

STAR stands for: Situation, Task, Action and Result.

Situation: Describe the situation/problem you were in, and provide necessary context (Your role, the team, the organization, the market, and also the blockers).

Task: Explain your responsibility and what you decided to do about it.

Action: Then step through how you went about implementing your solution.

Result: Finally, summarize with an analysis of your actions, highlighting the positive impact (share numbers) it had for your team, department, and organization, and emphasize what you learned.

List of sample behavioral questions:

1. Explain a scenario when you put your customer first

2. How do you deal with ambiguity? Do you wait until things are clear?

3. Explain a scenario when you disagreed with your team members? Walk me through what happened?

4.Tell me about a project where you used metrics to capture business insights

5.How did you respond to negative feedback?

6.Tell me about a time that you performed work outside of your role

7.When you commit to a deadline in Project and you cannot finish, how do you communicate to customers?

Follow-up questions

a) What was the reason that was attributed to missing the deadline?

b) What is the solution or fix that it won’t happen again?

8.Any example of what you did even before your customers asked for it?

9. Give me an example when you took a decision without consulting your

manager?

Follow-up questions

a) How do you justify that you have to skip manager approval?

b) What is the risk involved and what could be the impact if you had to wait till Manger Approval?

c) Were customers happy with your decision?

10. Given a product sale information on Amazon shopping website, what

would be your approach to decide which product to price cut to improve the sales?

Follow-up questions

a) How do you measure your success, like which product price cuts did better on quantity of sales, profit?

11. How do you design your dashboards to show/measure your success?

12. Give me an example when you were the outlier when the whole team was taking one decision?

13. Give me an example when you suggested a solution to a customer and had great success?

14. What is the proud moment of your career when you not only met the goal but exceeded?

15. Give me an example when you worked on something which is NOT your regular part of job or went beyond your responsibility and had great success?

Follow-ups questions:

a) what exactly you did and how it impacted the customers, How you tested your work and measured your success?

Additional references:

1.Solve sql questions(given at end) from here for practise

https://medium.com/better-programming/the-data-science-interview-study-guide-c3824cb76c2e

2.ML Glossary for Revision : https://ml-cheatsheet.readthedocs.io/en/latest/

3.Market Basket Analysis Revision : https://medium.com/swlh/a-tutorial-about-market-basket-analysis-in-python-predictive-hacks-497dc6e06b27

4.Web Analytics basics :

https://www.kaushik.net/avinash/impact-matrix-digital-analytics-framework/

https://www.kaushik.net/avinash/sitemap/

https://www.thinkwithgoogle.com/marketing-strategies/data-and-measurement/business-advertising-metrics/

5.ETL process info :

https://medium.com/hashmapinc/etl-understanding-it-and-effectively-using-it-f827a5b3e54d

6.SQL Fine-tuning

https://www.sisense.com/blog/8-ways-fine-tune-sql-queries-production-databases/

Frame the solutions in your head, and practice presenting the solution to your friend and see how smoothly it flows. It’s always a good idea to let the friend ask followup questions and see how you tackle them. All the best.

7. Product Cases Practice : Watch Product Manager Interviews for Root Cause Analysis, Market sizing, improving products and others to get an idea of how to approach a problem. The use-cases asked in Product Management and Data Scientist interviews are very similar and the same template can be used for answering the questions

Sample : https://www.youtube.com/watch?v=DSV-vuvmIro

8. For concepts in Statistics : Refer MarinStatsLectures on Statistics (https://www.statslectures.com/), Khan Academy (https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample)

Nikita Goswami,

Data Analyst, Scaled Web properties Team, Google

How to crack data science interviews at companies like Google, Amazon etc?

Written by Nikita Goswami