Week 5 FAQs

Author

Pulkit Mangal


πŸ“š Topics Covered in Week 5

  1. Supervised Learning
    • Regression
    • Classification
  2. Linear Regression
    • Simple Linear Model
    • Normal Equation
    • Gradient Descent
    • Stochastic Gradient Descent (SGD)
    • Kernel Regression
    • Probabilistic Perspective

πŸ” Visualization of Linear Regression

Linear regression is a fundamental statistical method that models the relationship between input features and a target variable by fitting a linear function. It assumes a linear correlation between data points and their corresponding labels.

Single Feature Regression

Linear regression with one feature

Two Features Regression

Linear regression with two features

For a dataset in \(\mathbb{R}^d\), the best-fit hyperplane describing the linear relationship lies in \(\mathbb{R}^{d+1}\).


πŸ” Performance Metrics

For a given dataset, where \(\mathbf{x}_i\) represents the data point and \(y_i\) is the corresponding true label, we define the following error metrics:

  • Sum of Squared Errors (SSE) \[ \text{SSE} = \sum_{i=1}^{n} \left[f(\mathbf{x}_i) - y_i \right]^2 \]

  • Mean Squared Error (MSE) \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left[f(\mathbf{x}_i) - y_i \right]^2 \]

  • Root Mean Squared Error (RMSE) \[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left[f(\mathbf{x}_i) - y_i \right]^2} \]


πŸ” Normal Equation

The Normal Equation provides a closed-form solution to the least squares problem. For a data matrix \(\mathbf{X}\) of shape \(d \times n\):

\[ \hat{\mathbf{w}} = \left(\mathbf{XX}^T\right)^{\dagger} \mathbf{Xy} \]

where \((\mathbf{XX}^T)^{\dagger}\) represents the Moore–Penrose inverse and \(\mathbf{y}\) is the true label vector.

If \(\mathbf{XX}^T\) is invertible then

\[ \hat{\mathbf{w}} = \left(\mathbf{XX}^T\right)^{-1} \mathbf{Xy} \]

For a single-feature case, the simplified form is:

\[ \hat{\mathbf{w}} = \frac{\sum^{n}_{i=1} \mathbf{x}_i y_i}{\sum^{n}_{i=1} \mathbf{x}_i^2} \]


πŸ” Gradient Descent Method

Computing the Moore-Penrose inverse is computationally expensive for large datasets. Instead, Gradient Descent provides an iterative optimization approach to minimize the error function.

The gradient of SSE is given by:

\[ \nabla L(\mathbf{w}) = 2[\mathbf{XX}^T\mathbf{w} - \mathbf{Xy}] \]

The weight update rule is:

\[ \mathbf{w}^{t+1} = \mathbf{w}^{t} - \eta \nabla{L(\mathbf{w})} \]

where \(L\) represents the loss function, and \(\eta\) is the learning rate.


πŸ” Stochastic Gradient Descent (SGD)

For large datasets, computing gradients over the entire dataset can be expensive. SGD mitigates this by computing gradients using a randomly selected subset of data at each iteration.

πŸ“Ί Watch this video for a detailed explanation: SGD vs GD

⚠️ Note: As SGD updates based on small subsets, it introduces more variance during convergence, as visualized below:

SGD vs GD

πŸ” Geometric Perspective

Our objective is to find \(\mathbf{w}\) such that:

\[ \mathbf{X}^T \mathbf{w} \approx \mathbf{y} \]

From a geometric viewpoint, \(\mathbf{X}^T \mathbf{w}\) represents a linear combination of feature vectors. The projection of \(\mathbf{y}\) onto the plane spanned by the columns of \(\mathbf{X}^T\) gives the best approximation.

Geometric view

This naturally leads back to the Normal Equation formulation.


πŸ” Kernel Regression

For non-linear relationships, the weight vector is defined as:

\[ \hat{w} = \phi(\mathbf{X})\alpha \]

where \(\alpha = \mathbb{K}^{\dagger} y\).

Prediction Formula:

\[ \hat{y} = \sum^{n}_{i=1} \hat{\alpha_i} \cdot k(\mathbf{x}_\text{test}, \mathbf{x}_i) \]

  • \(\alpha_i\) represents the importance of the \(i^{\text{th}}\) data point in predicting the label.
  • \(k(\mathbf{x}_\text{test}, \mathbf{x}_i)\) measures similarity between the test and training data points.

πŸ” Probabilistic Perspective

We assume that each label is generated by adding noise to the true relationship:

\[ y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon_i \]

where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\), leading to:

\[ y_i \mid x_i \sim \mathcal{N}(0, \sigma^2) \]

Using Maximum Likelihood Estimation (MLE), we derive:

\[ \min_{w} \sum_{i=1}^{n} (y_i - \mathbf{w}^T\mathbf{x}_i)^2 \]

which coincides with minimizing SSE. This equivalence holds under a Gaussian noise assumption. If we assume a Laplace distribution, minimizing SSE is replaced by minimizing the absolute error, which leads to a robust regression technique. Read more on Laplace distribution applications.


πŸ’‘ Need Help?

For technical queries, feel free to reach out via email: πŸ“§ 22f3001839@ds.study.iitm.ac.in