码迷,mamicode.com
首页 > 其他好文 > 详细

2/15 简单线性回归

时间:2018-02-15 21:57:07      阅读:249      评论:0      收藏:0      [点我收藏+]

标签:model   math   orm   git   step   module   fit   eval   cas   

 

  • Regression: Predict a continuous response

    Linear regression

    Pros: fast, no tuning required, highly interpretable, well-understood

    Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

    Form of linear regression

    技术分享图片

    • 技术分享图片 is the response
    • 技术分享图片 is the intercept
    • 技术分享图片 is the coefficient for 技术分享图片 (the first feature)
    • 技术分享图片 is the coefficient for 技术分享图片 (the nth feature)

    In this case:

    技术分享图片

    The 技术分享图片 values are called the model coefficients. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions.import pandas as pdimport seaborn as sns:

    # allow plots to appear within the notebook
    %matplotlib inline
    
    # read CSV file directly from a URL and save the results
    data = pd.read_csv(http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv, index_col=0)#忽略第一列为index列
    data.head()
    
    # visualize the relationship between the features and the response using scatterplots
    sns.pairplot(data, x_vars=[TV,radio,newspaper], y_vars=sales, size=7, aspect=0.7, kind=reg)
    
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # default split is 75% for training and 25% for testing
    print(X_train.shape)
    print(y_train.shape)
    print(X_test.shape)
    print(y_test.shape)
    
    #Linear Regression in scikit-learn
    # import model
    from sklearn.linear_model import LinearRegression
    
    # instantiate
    linreg = LinearRegression()
    
    # fit the model to the training data (learn the coefficients)
    linreg.fit(X_train, y_train)
    
    #Interpreting model coefficients
    # print the intercept and coefficients
    print(linreg.intercept_)
    print(linreg.coef_)
    
    # pair the feature names with the coefficients
    list(zip(feature_cols, linreg.coef_))
    
    #Making predictions
    # make predictions on the testing set
    y_pred = linreg.predict(X_test)
    
    #Feature selection
    # create a Python list of feature names
    feature_cols = [TV, Radio]
    
    # use the list to select a subset of the original DataFrame
    X = data[feature_cols]
    
    # select a Series from the DataFrame
    y = data.Sales
    
    # split into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # fit the model to the training data (learn the coefficients)
    linreg.fit(X_train, y_train)
    
    # make predictions on the testing set
    y_pred = linreg.predict(X_test)
    
    # compute the RMSE of our predictions
    print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

     Out Module: 

    [(‘TV‘, 0.046564567874150288),
     (‘Radio‘, 0.17915812245088836),
     (‘Newspaper‘, 0.0034504647111804065)]

    RMSE = 1.38790346994
     
    技术分享图片

    How do we interpret the TV coefficient (0.0466)?

    • For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
    • Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

    Important notes:

    • This is a statement of association, not causation.
    • If an increase in TV ad spending was associated with a decrease in sales, 技术分享图片 would be negative.

    Model evaluation metrics for regression

    Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

    Let‘s create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

    Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

                                                               技术分享图片

    Mean Squared Error (MSE) is the mean of the squared errors:

                                                              技术分享图片

    Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

                                                           技术分享图片

    Comparing these metrics:

    • MAE is the easiest to understand, because it‘s the average error.
    • MSE is more popular than MAE, because MSE "punishes" larger errors.
    • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units

    The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

2/15 简单线性回归

标签:model   math   orm   git   step   module   fit   eval   cas   

原文地址:https://www.cnblogs.com/lowkeysingsing/p/8449824.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!