What is Machine Learning?
Refer to this link
ML is NOT the solution to every problem. If simple rules work, use them instead.
Main types of ML
-
Supervised learning - you have data and labels. The algorithm learns by predicting labels and correcting mistakes.
- Classification - categorizing samples (binary for 2 options, multi-class for 3+)
- Regression - predicting numbers (e.g., “how many users will subscribe?”)
-
Unsupervised learning - you have data but no labels. The algorithm finds patterns, and you interpret them. Example: clustering customers into “summer buyers” and “winter buyers” for targeted promotions.
-
Transfer learning - reusing a trained model for a new task (e.g., adapting a car recognition model to identify dog breeds). Valuable because training models from scratch is expensive.
-
Reinforcement learning - the algorithm learns by trial and error within defined rules, earning rewards or penalties. Example: teaching an AI to play chess by updating its score based on moves.
Matching your problem
- Supervised learning - you know inputs and outputs
- Unsupervised learning - you have inputs but uncertain outputs
- Transfer learning - your problem resembles an existing one
Typical Machine Learning Flow
Problem → Data → Evaluation → Features → Modeling → Experiments
- Problem - What are we trying to solve?
- Data - What data do we have?
- Evaluation - What does success look like?
- Features - Which variables should we feed into the model?
- Modeling - Which model fits our problem best?
- Experiments - What else can we try to improve results?
Evaluation
Evaluation defines what success looks like. Common metrics include:
-
Accuracy - How often is the model correct overall?
$Accuracy = \frac {TP + TN} {TP + TN + FP + FN}$ -
Precision - When the model predicts positive, how often is it right?
$Precision = \frac {TP} {TP + FP}$ -
Recall - Of all actual positives, how many did the model catch?
$Recall = \frac {TP} {TP + FN}$
When to prioritize each:
- High precision - Use when false positives are costly. Example: spam filters (don’t want important emails marked as spam).
- The model acts more deliberately, so when it reports positive, it is very likely to be actual positive.
- High recall - Use when false negatives are costly. Example: cancer detection (missing a case is worse than a false alarm).
- The modle becomes more sensitive. It may reports False Positive, but it tries its best to catch all ‘potential’ positives.
Modeling
Modeling is the core of the ML workflow, consisting of three stages:
- Training - train the model on data
- Validation - tune the model’s parameters
- Test - verify the model’s performance
The dataset is split accordingly:
| Set | Split |
|---|---|
| Training | 70% – 80% |
| Validation | 10% – 15% |
| Test | 10% – 15% |
A key goal is generalization — a good model performs well on data it has never seen before, producing similar results across all three sets.
Watch out for these two failure modes:
| Data Set | Underfitting | Overfitting |
|---|---|---|
| Training | 62% | 95% |
| Test | 50% | 60% |
- Underfitting - poor accuracy across the board; the model is too simple for the problem.
- Overfitting - high training accuracy but poor test accuracy; the model memorized the training data instead of learning the underlying pattern.
Commonly Used Tools
Core Python Libraries:
- NumPy - numerical computing with arrays and matrices
- Pandas - data manipulation and analysis
- Matplotlib/Seaborn - data visualization
ML Frameworks:
- Scikit-learn - classical ML algorithms (regression, classification, clustering)
- TensorFlow - Google’s deep learning framework
- PyTorch - Facebook’s deep learning framework, popular in research
- Keras - high-level neural network API (now multi-backend with TensorFlow, PyTorch and JAX)
Development Tools:
- Jupyter Notebook - interactive coding environment
- Google Colab - free cloud-based Jupyter notebooks with GPU access
- Anaconda/Miniconda - Python distribution with package management
Optional but Popular:
- MLflow - experiment tracking and model management
- Hugging Face - pre-trained models for NLP tasks