Data Fundamentals - Part 2

Statistics and machine learning

Jun 16, 2022

When I talk to SharpestMinds mentors, I like to ask them about the mistakes they see aspiring data scientists make. One common answer: too much focus on tools and state-of-the-art algorithms and not enough on the fundamentals.

The fundamentals for data-related roles can be grouped into four broad categories. The importance—and the necessary depth—of each category will vary by role, company, and industry.

SQL
Statistics + machine learning
Programming / engineering
Communication / business acumen

This is part two in a series going through each of the categories below, with advice sourced from the SharpestMinds community.

Statistics + machine learning

A solid foundation of statistics and machine learning concepts is important for data scientists. A common mistake many beginners make, however, is to jump too quickly to new and exciting state-of-art neural nets before mastering the fundamentals.

SharpestMinds mentor, Vejey, likes to ensure that his mentees have a strong understanding of classic machine learning models before they even think about deep learning. His list, shown below, “basically covers all the variants of ML algorithms— regression, clustering, dimensionality reduction, classification, bagging, and boosting.”

Linear Regression (all major variants - Lasso, Ridge, ..)
Logistic Regression
Naive Bayes
KNN
PCA
K-Means Clustering
Support Vector Machine
Decision Tree
Random Forest
XGBoost

To ensure his mentees fully grok each algorithm, Vejey makes sure they can: (1) explain it to a general audience, (2) explain it to a technical audience, (3) implement it in code (using popular libraries like scikit-learn), and, (4) if there is time, implement it from “scratch” (e.g. using Numpy).

Once these fundamentals are down, then they might move on to deep learning. This will depend heavily on the role and domain they are interested in. Computer vision and NLP applications, for example, are more likely to employ deep learning algorithms—like ConvNets and Transformers, respectively. To understand these algorithms, Vejey will use the same 4 steps above.

Of course, understanding all of these machine learning models requires a base knowledge of some underlying statistical concepts. I asked this question in the SM Slack and on LinkedIn and the most popular answers were:

Gradient descent and stochastic gradient descent (and the difference between the two)
Bias-variance tradeoff
Precision and recall
Regularization
Similarity/distance measures

This list is by no means exhaustive, and each item rests on top of even more fundamental concepts from probability, linear algebra, and calculus. Understanding these concepts, along with how they are related to the algorithms listed above, would make a great base of knowledge for aspiring data scientists.

For other data roles—like data analysis, machine learning engineering, and data engineering—a robust understanding of statistics and ML theory is not always necessary. But according to Judy, a former director of data science and product management, “it depends on the company. Data analysts at smaller companies are typically not dealing with large amounts of data—unless their product is an ML product. They are more interested in how to accelerate the business, so the requirements might be more centered around how to improve KPIs or knowing your way around a dashboard.

“Companies with larger staff tend to be more sophisticated in their data science and will require analysts, especially if they plan to help with experimentation, to have some basic stats and probability understanding,” says Judy. “With ML engineering, it's closer to… software engineering, but at larger companies [that have to] process large amounts of data in real-time, a basic understanding of linear algebra may be important because you need to write efficient code.”

Here are some resources that the SharpestMinds community recommends for learning and mastering statistics and machine learning:

Landing a Data Job: The Course - A series of video lessons from the SharpestMinds team on some key data science and machine learning concepts
The scikit-learn user guide has good intros (with code) on many fundamental ML concepts
ZedStatistics - YouTube channel with lessons on stats
StatQuest with Josh Starmer - Another Youtube channel to learn statistics
Intro to Probability, Statistics, and Random Processes - A good book on probability and statistics for beginners
Practical Statistics for Data Scientists - A good book to brush up on basic concepts. Great material to prep for data science interviews!
Ken Jee - A great YouTube channel for stats and data science
Tina Huang - Another great YouTube channel for stats and data science

Mathematics for Machine Learning - Free e-book. According to an SM mentor, that works at meta, “The level of understanding of basic algorithms in this book is about the level of understanding you'd need for an ML eng/research scientist job at Meta.”

The SharpestMinds Newsletter

Discussion about this post

The SharpestMinds Newsletter

Data Fundamentals - Part 2

Statistics and machine learning

Statistics + machine learning

More from the SM Community

Discussion about this post