### Problem 1
Load Boston housing dataset and perform linear regression on it using Linear Least Squares method. Check the performance (MSE) on a selected testing subset.

Don't forget than if our model is $y = a_1\theta_1 + \ldots a_n\theta_n + \theta_0$ we need to expand data matrix with additional column of ones to match $n+1$ parameters (including $\theta_0$).

Useful function `scipy.linalg.pinv`

In [1]:
import numpy as np
import sklearn.datasets as ds
import scipy.linalg as la

In [2]:
data = ds.load_boston()

In [3]:
test_set = np.arange(100)
train_set = np.arange(100, data.data.shape[0])

In [4]:
A = np.hstack((data.data[train_set,:], np.ones((train_set.shape[0],1))))
print(A.shape)

(406, 14)


In [5]:
B = A.T @ A
y = A.T @ data.target[train_set]
theta = la.pinv(B)@y
print(theta)

[-1.09124839e-01  5.88996298e-02  5.32143114e-02  2.53136435e+00
 -2.11593358e+01  3.28228540e+00 -4.04396115e-03 -1.78329769e+00
  3.18774694e-01 -1.25180886e-02 -1.01186219e+00  9.28779352e-03
 -5.81756268e-01  4.44936476e+01]


In [6]:
y_pred = np.hstack((data.data[test_set,:], np.ones((test_set.shape[0],1)))) @ theta
y_real = data.target[test_set]
diff = np.sqrt(np.mean((y_pred - y_real)**2))
print(diff)

3.5498621468654714


In [7]:
print(y_real.min(), y_real.max(), y_real.mean())

12.7 43.8 22.309


### Problem 2
Load *digits* dataset, select images corresponding to digits 0 and 1. Set apart random 10% of the data for evaluation. Use Linear Least Squares to train a naive classifier by regressing the pixel values of every digits to it's numerical value:
$$y_i = \sum a_i\theta_i + \theta_0$$
where $y_i$ is 0 or 1. Perform classification by thresholding the regressed value. Check the accuracy on the training and testing subsets.

**Make sure to fix random seed for reproducibility.**

In [8]:
import numpy as np
import sklearn.datasets as ds
import scipy.linalg as la

In [9]:
D = ds.load_digits()

In [10]:
ind0 = np.where(D.target==7)[0]
ind1 = np.where(D.target==1)[0]
inds = np.union1d(ind0, ind1)

In [11]:
eval_set = np.random.choice(inds, int(0.1*inds.size))
train_set = np.setdiff1d(inds, eval_set)
eval_set = np.setdiff1d(inds, train_set)

In [12]:
A = np.hstack((D.data[train_set,:], np.ones((train_set.shape[0],1))))
B = A.T @ A
y = A.T @ D.target[train_set]
theta = la.pinv(B)@y

In [13]:
y_eval = np.hstack((D.data[eval_set,:], np.ones((eval_set.shape[0],1)))) @ theta
print(y_eval)

[0.51020613 6.63773496 5.8307016  6.83950527 6.61966882 0.29029398
 0.79622023 0.10698233 6.48237067 7.3669215  1.78478063 7.12936113
 8.07562656 1.02251406 0.92598856 7.24900515 0.89678487 0.94313639
 1.16117067 6.58177228 0.65655362 7.69397547 6.61171956 1.56542454
 1.96960137 7.96012215 6.74457881 1.47925404 0.85859511 2.81711423
 1.62588477 0.47017345 7.25048242]


In [14]:
y_pred = np.ones(D.target[eval_set].shape)
y_pred[y_eval>=4] = 7
correct = np.sum(y_pred==D.target[eval_set])
acc = correct/eval_set.size
print(acc)

1.0


### Problem 3
Generate arbitrary nonlinear function on a sample of 10 points from $(0; 10]$, add some Gaussian or discrete noise to the values. Solve Least-Squares problem to fit the data points with polynomial models of various degree. Report RMSE on the obtained models. Plot points vs. model vs. true function. Useful functions: `np.vander`, `np.linspace`, `np.vectorize`

### Problem 4

Попробуйте решить Задачу 1 (Boston housing dataset), моделируя данные многочленом второй степени:
$$y = \theta_0 + a_1\theta_1 + \ldots+ a_n\theta_n + a_1^2\theta_{11} + a_1a_2\theta_{12}+ \ldots +a_n^2\theta_{nn} = a_0 + \sum_{i=1}^{n}a_i\theta_i + \sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_j\theta_{ij}$$ 