Reading-Notes

Project maintained by eslamakram Hosted on GitHub Pages — Theme by mattgraham

Machine Learning

Bird’s Eye View of the machine learning workflow:

Our machine learning blueprint is designed around those 3 elements.

There are 5 core steps:

Exploratory Analysis. In many ways, training a ML model is like growing a startup. You also have too many tactics to choose from:

Should you clean your data more? Engineer features? Test new algorithms? Etc.

There’s a lot of trial and error, so how do you avoid chasing dead ends? The answer is “Exploratory Analysis.” (Which is just fancy-talk for “getting to know” your data.)

Data Cleaning Proper data cleaning is the “secret” sauce behind machine learning… Well, it’s not really a “secret”… It’s just a bit boring, so no one really talks about it. But the truth is:

Better data beats fancier algorithms…

Feature Engineering
Algorithm Selection

Choose the best, most appropriate algorithms without wasting your time.

Model Training

Finally, train your models. This step is pretty formulaic once you’ve done the first 4.

Model Training

At last, it’s time to build our models!

It might seem like it took us a while to get here, but professional data scientists actually spend the bulk of their time on the steps leading up to this one:

Exploring the data.
Cleaning the data.
Engineering new features. Again, that’s because better data beats fancier algorithms.

Split Dataset Let’s start with a crucial but sometimes overlooked step: Spending your data. Think of your data as a limited resource.

You can spend some of it to train your model (i.e. feed it to the algorithm). You can spend some of it to evaluate (test) your model. But you can’t reuse the same data for both!

Model parameters

Model parameters are learned attributes that define individual models.

e.g. regression coefficients e.g. decision tree split locations They can be learned directly from the training data Hyperparameters

Hyperparameters express “higher-level” structural settings for algorithms.

e.g. strength of the penalty used in regularized regression e.g. the number of trees to include in a random forest They are decided before fitting the model because they can’t be learned from the data

Fit and Tune Models Now that we’ve split our dataset into training and test sets, and we’ve learned about hyperparameters and cross-validation, we’re ready fit and tune our models.

Basically, all we need to do is perform the entire cross-validation loop detailed above on each set of hyperparameter values we’d like to try.

summary

pythonml