Data science is one of the most paid jobs in the IT industry. You would be staggered to know that they get around $120,000 salary in a year. But do you know they are required to have very high skill set to avail such a career.
Do you have any idea how difficult it is to face data science interviews? It is for this reason I have penned a couple of most frequently asked questions for data science candidates.
1. What is root cause analysis?
For finding out root causes of problems RCA(root cause analysis) method is used. A factor is a root cause if the removal of it from the problem-fault prevents unwanted events from recurring.
A causal factor is different in the sense it does affect an event’s outcome but the difference is that it is not the root cause. Industrial accidents, software testing, healthcare, project management are the main areas use this analysis.
2. What is a resampling method and why is it useful?
Classical parametric tests compare observed statistics with theoretical distributions. Resampling a data-driven methodology is based on recurring sampling within the same sampling.
Resampling refers to methods for achieving the following:
1) When performing significance tests like randomization tests and re-randomization tests labels are exchanged on data points.
2) Precision of sample statistics (medians, percentiles) is estimated by using subsets of available datasets.
3) It is used in validating models while using random subsets (cross validation, bootstrapping)
3. What is cross validation?
It is used for evaluating as to how the outcomes of a particular statistical analysis will generalize to an independent dataset. It is primarily used when the goal is to forecast and if one wants to see how the model works in real world. In order to limit the problems of overfitting, models are tested in the training phase.
4. Explain selection bias
Due to non-random sample of population, error gets introduced which is a problematic state indeed. Consider an example sample of 100 test cases being made up of 55/25/15/5 split of 4 cases which really occurred in equal population numbers, then a model would most likely make the false assumption that probability is the deciding predictive factor.
Avoiding altogether non-random samples is the best antidote to cope with this bias. When this is not possible then boosting, resampling and weighting are strategies introduced to cope with this situation.
5. Why is A/B testing used?
For two variables A and B this is a statistical hypothesis testing in a randomized experiment. It is useful in reducing bounce rates, enhancing customer engagement, easing analysis, higher conversion values and so on.
6. What are the benefits of using random forest?
This technique is used to combine several weak learners to provide a strong learner.
1) A common random forest algorithm can be used for both regression and classification task.
2) Using random forest algorithm for classification will avoid the problem of overfitting.
3) This algorithm can be used in feature engineering in that out of the total available features the most important features from the dataset can be identified.
7. What is logistic regression?
Also known as the logit model this method is used to predict the binary outcome from a linear combination of predictor variables. This method is popular because the results are easy to interpret.
8. What do you know about feature vectors?
To represent a new item a feature vector is used which is nothing but an n-dimensional vector of numerical features. In machine mastering, feature vectors are used to symbolize numeric or symbolic traits, referred to as capabilities, of an object in a mathematical, easily analyzable way.
9. Do you have an idea about statistical strength?
Statistical strength or sensitivity of a binary hypothesis check is the probability that when the alternative hypothesis (H1) is true then the test will correctly reject null hypothesis (H0).
To put in some other manner, Statistical strength is the probability that a study will detect an impact when there is an effect. The better the statistical strength, the less probable you’re prone to make a type II mistakes.
10. How can you ensure that your modification in an algorithm is an improvement over it?
Consider a scenario where you might come up with several ideas with improvement potential. Mostly these ideas will be sent for implementation for which you have to provide supporting data. Most often limited results are sent which will be impacted by selection bias.
There are a few guidelines which are helpful in determining that the modification to the algorithm is actually an improvement. They are:
1) The test data being used for performance comparison should not have selection bias.
2) One has to verify whether the results reflect the local maxima/minima or global maxima/minima.
3) The test data should have enough variety so that it closely aligns with real life data
4) While comparing performance, the test environment should be common with no variation for original algorithm and new algorithm.
5) Even if tests are recurred there should be similar results
11. Which is better huge number of false positives or a huge number of false negatives?
False negatives may provide an incorrect message to patients and the doctor that the disease is absent when it is actually present. This obviously leads to potential danger to the patient because of inadequate treatment. So naturally it is required to have too many false positives in this case.
In spam filtering, a false positive occurs when spam filtering mechanism wrongly classify a genuine email message as spam and therefore stops its delivery. In this case however false negatives is preferred over false positives.
12. What are the assumptions required for linear regression?
There are four major assumptions:
1) The data residuals are normally distributed and are independent from each other.
2) Between the regressors and dependent variables there is a linear relationship. It is another way of saying that your model actually fits the data
3) Homoscedasticity which means for all values of the predictor variable the variance around the regression line is the same
4) Between explanatory variables there is minimal multicollinearity.
You can check out this resource for more Data Science Interview questions .