WooliesX Data Science Test - Machine learning and statistics
Question 1
We are measuring the brightness of a star with a photon detector that produces a
luminosity score. We point it at a particular star and take a large number of readings.
Unfortunately, the readings are noisy and we observe that some readings indicate
the star has negative brightness. Would you discard the negative readings? What
effect does this have on the data and the readings we make from it?
Question 2
You have fitted a GBM model and are happy with its accuracy. How will you explain,
in business terms, to your stakeholders what the model is doing? What insights can
you draw from the model?
Question 3
Imagine you have the same dataset for training a predictive model. you once use
XGboost and once a randomforest methodology (not eXtreme boosting). Under
which scenario do you expect the depth of the trees to be higher?
Question 4
Assume you have built a classification model which has an accuracy of 90% on the
test set. Under what circumstances could this still be a bad model?
Question 5
You are supposed to make a propensity to purchase model using XGBoost, and you
have 40k features on customers in the feature bank. Given it is not feasible to
productionise a model with this many features, how do you quantitatively reduce the
number of features to something feasible (say 500 features)?
Question 6
What are the advantages of a model like XGBoost over logistic regression? What are
the disadvantages?
Question 7
If you have a dataset that has a size larger than the amount of RAM in your
computer, list at least 3 ways to help in fitting a model on this data.
Question 8
You have made a very powerful predictive model for customers weekly sales. What
is your favorite method of explaining the importance of the features in your model?
Does this method consider interactions between features? If the feature is
categorical, does this method work better with one-hot encoding or label encoding?
Does this method explain the direction of the effect of the feature on the target
variable (direct or inverse)?
Question 9
How do you compare one-hot encoding and label encoding? When would one-hot
encoding work better? And when would it be the other way around? Any other
approach to encoding?
Question 10
You are developing a GBM model to predict customers' weekly spend in
supermarkets. From the data you collected you realised that about 30% of your
target variable were zeros, i.e. 30% of customers had zero weekly spend in the past.
State your plan for modelling.
Question 11
A promotion offer was sent to two groups of customers, Group A and Group B,
consisting of 1180 and 5740 customers, respectively. The redemption rate was 21%
for Group A and 25% for Group B. Determine whether the two redemption rates are
significantly different. Report the associated p-value. State any assumptions you
may make.
Question 12
You have a friend who randomly decides whether he goes out for a drink on Friday
nights with probability of going out being 90%. If he goes out, he randomly chooses
from three bars, A, B and C, with equal probabilities. Suppose you are trying to find
him on a Friday night, and you have checked Bar A and B and he is not in either of
those two. What is the probability that you will find him in Bar C? Apply the Bayes
rule and show steps.