Machine Learning 101

What is Machine Learning?

Refer to this link

ML is NOT the solution to every problem. If simple rules work, use them instead.

Main types of ML

  • Supervised learning - you have data and labels. The algorithm learns by predicting labels and correcting mistakes.

    • Classification - categorizing samples (binary for 2 options, multi-class for 3+)
    • Regression - predicting numbers (e.g., “how many users will subscribe?”)
  • Unsupervised learning - you have data but no labels. The algorithm finds patterns, and you interpret them. Example: clustering customers into “summer buyers” and “winter buyers” for targeted promotions.

  • Transfer learning - reusing a trained model for a new task (e.g., adapting a car recognition model to identify dog breeds). Valuable because training models from scratch is expensive.

  • Reinforcement learning - the algorithm learns by trial and error within defined rules, earning rewards or penalties. Example: teaching an AI to play chess by updating its score based on moves.

Matching your problem

  • Supervised learning - you know inputs and outputs
  • Unsupervised learning - you have inputs but uncertain outputs
  • Transfer learning - your problem resembles an existing one

Typical Machine Learning Flow

Problem → Data → Evaluation → Features → Modeling → Experiments
  • Problem - What are we trying to solve?
  • Data - What data do we have?
  • Evaluation - What does success look like?
  • Features - Which variables should we feed into the model?
  • Modeling - Which model fits our problem best?
  • Experiments - What else can we try to improve results?

Evaluation

Evaluation defines what success looks like. Common metrics include:

  • Accuracy - How often is the model correct overall?
    $Accuracy = \frac {TP + TN} {TP + TN + FP + FN}$

  • Precision - When the model predicts positive, how often is it right?
    $Precision = \frac {TP} {TP + FP}$

  • Recall - Of all actual positives, how many did the model catch?
    $Recall = \frac {TP} {TP + FN}$

When to prioritize each:

  • High precision - Use when false positives are costly. Example: spam filters (don’t want important emails marked as spam).
    • The model acts more deliberately, so when it reports positive, it is very likely to be actual positive.
  • High recall - Use when false negatives are costly. Example: cancer detection (missing a case is worse than a false alarm).
    • The modle becomes more sensitive. It may reports False Positive, but it tries its best to catch all ‘potential’ positives.

Modeling

Modeling is the core of the ML workflow, consisting of three stages:

  • Training - train the model on data
  • Validation - tune the model’s parameters
  • Test - verify the model’s performance

The dataset is split accordingly:

Set Split
Training 70% – 80%
Validation 10% – 15%
Test 10% – 15%

A key goal is generalization — a good model performs well on data it has never seen before, producing similar results across all three sets.

Watch out for these two failure modes:

Data Set Underfitting Overfitting
Training 62% 95%
Test 50% 60%
  • Underfitting - poor accuracy across the board; the model is too simple for the problem.
  • Overfitting - high training accuracy but poor test accuracy; the model memorized the training data instead of learning the underlying pattern.
Underfitting, Overfitting and Balanced

Commonly Used Tools

Core Python Libraries:

  • NumPy - numerical computing with arrays and matrices
  • Pandas - data manipulation and analysis
  • Matplotlib/Seaborn - data visualization

ML Frameworks:

  • Scikit-learn - classical ML algorithms (regression, classification, clustering)
  • TensorFlow - Google’s deep learning framework
  • PyTorch - Facebook’s deep learning framework, popular in research
  • Keras - high-level neural network API (now multi-backend with TensorFlow, PyTorch and JAX)

Development Tools:

  • Jupyter Notebook - interactive coding environment
  • Google Colab - free cloud-based Jupyter notebooks with GPU access
  • Anaconda/Miniconda - Python distribution with package management

Optional but Popular:

  • MLflow - experiment tracking and model management
  • Hugging Face - pre-trained models for NLP tasks
Built with Hugo
Theme Stack designed by Jimmy