2/15 简单线性回归

时间：2018-02-15 21:57:07 阅读：249 评论：0 收藏：0 [点我收藏+]

标签：model math orm git step module fit eval cas

Regression: Predict a continuous response

Linear regression

Pros: fast, no tuning required, highly interpretable, well-understood

Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)
Form of linear regression

$技术分享图片$
- $技术分享图片$ is the response
- $技术分享图片$ is the intercept
- $技术分享图片$ is the coefficient for $技术分享图片$ (the first feature)
- $技术分享图片$ is the coefficient for $技术分享图片$ (the nth feature)
In this case:

$技术分享图片$

The $技术分享图片$ values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions.import pandas as pdimport seaborn as sns:
# allow plots to appear within the notebook %matplotlib inline # read CSV file directly from a URL and save the results data = pd.read_csv(‘http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv‘, index_col=0)#忽略第一列为index列 data.head() # visualize the relationship between the features and the response using scatterplots sns.pairplot(data, x_vars=[‘TV‘,‘radio‘,‘newspaper‘], y_vars=‘sales‘, size=7, aspect=0.7, kind=‘reg‘) from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # default split is 75% for training and 25% for testing print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) #Linear Regression in scikit-learn # import model from sklearn.linear_model import LinearRegression # instantiate linreg = LinearRegression() # fit the model to the training data (learn the coefficients) linreg.fit(X_train, y_train) #Interpreting model coefficients # print the intercept and coefficients print(linreg.intercept_) print(linreg.coef_) # pair the feature names with the coefficients list(zip(feature_cols, linreg.coef_)) #Making predictions # make predictions on the testing set y_pred = linreg.predict(X_test) #Feature selection # create a Python list of feature names feature_cols = [‘TV‘, ‘Radio‘] # use the list to select a subset of the original DataFrame X = data[feature_cols] # select a Series from the DataFrame y = data.Sales # split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # fit the model to the training data (learn the coefficients) linreg.fit(X_train, y_train) # make predictions on the testing set y_pred = linreg.predict(X_test) # compute the RMSE of our predictions print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
　Out Module:　
[(‘TV‘, 0.046564567874150288), (‘Radio‘, 0.17915812245088836), (‘Newspaper‘, 0.0034504647111804065)]

RMSE = 1.38790346994
$技术分享图片$
How do we interpret the TV coefficient (0.0466)?

For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.

Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

Important notes:

This is a statement of association, not causation.

If an increase in TV ad spending was associated with a decrease in sales, $技术分享图片$ would be negative.

Model evaluation metrics for regression

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let‘s create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$技术分享图片$

Mean Squared Error (MSE) is the mean of the squared errors:

$技术分享图片$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$技术分享图片$

Comparing these metrics:

MAE is the easiest to understand, because it‘s the average error.

MSE is more popular than MAE, because MSE "punishes" larger errors.

RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units

The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.