🔥 *20 Data Science Interview Questions (with Detailed Answers)*
*1. What is Data Science*
A multidisciplinary field that extracts insights from structured and unstructured data using statistics, machine learning, and domain expertise.
*2. What is the difference between supervised and unsupervised learning*
• Supervised: Uses labeled data (e.g., regression, classification)
• Unsupervised: Uses unlabeled data (e.g., clustering, dimensionality reduction)
*3. What is overfitting in machine learning*
When a model learns noise and details from training data, performing poorly on unseen data.
Solution: Use regularization, cross-validation, or simpler models.
*4. What is the bias-variance tradeoff*
• Bias: Error due to overly simplistic model
• Variance: Error due to model complexity
Goal: Find balance for optimal performance.
*5. What is the difference between classification and regression*
• Classification: Predicts categories (e.g., spam or not)
• Regression: Predicts continuous values (e.g., house price)
*6. What is feature engineering*
Creating new input features from raw data to improve model performance.
Examples: Binning, encoding, scaling, interaction terms.
*7. What is the purpose of cross-validation*
To evaluate model performance on unseen data by splitting data into training and validation sets multiple times.
*8. What is a confusion matrix*
A table showing true positives, false positives, true negatives, and false negatives.
Used to evaluate classification models.
*9. What is precision, recall, and F1-score*
• Precision: TP / (TP + FP)
• Recall: TP / (TP + FN)
• F1-score: Harmonic mean of precision and recall
*10. What is the difference between bagging and boosting*
• Bagging: Combines models in parallel (e.g., Random Forest)
• Boosting: Combines models sequentially (e.g., XGBoost)
*11. What is PCA (Principal Component Analysis)*
A dimensionality reduction technique that transforms features into principal components while retaining variance.
*12. What is the difference between parametric and non-parametric models*
• Parametric: Assumes fixed number of parameters (e.g., linear regression)
• Non-parametric: Flexible, adapts to data complexity (e.g., k-NN)
*13. What is the purpose of regularization*
To prevent overfitting by penalizing large coefficients
Types: L1 (Lasso), L2 (Ridge)
*14. What is the Central Limit Theorem*
The sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of population distribution.
*15. What is hypothesis testing*
A statistical method to test assumptions about a population
Example: t-test, chi-square test
*16. What is the difference between SQL and NoSQL databases*
• SQL: Structured, relational (e.g., MySQL)
• NoSQL: Unstructured, flexible schema (e.g., MongoDB)
*17. What is the ROC curve and AUC*
• ROC: Plots TPR vs. FPR
• AUC: Area under ROC curve, measures model’s ability to distinguish classes
*18. What is time series analysis*
Analyzing data points collected over time
Techniques: ARIMA, seasonal decomposition, forecasting
*19. What is the difference between batch and online learning*
• Batch: Trains on entire dataset
• Online: Trains incrementally as data arrives
*20. What is the role of a data scientist in a business setting*
• Understand business problems
• Collect and clean data
• Build models
• Communicate insights
• Drive data-driven decisions
❤️ *React for more Interview Resources*
