## Plotting confidence intervals of linear regression in Python

After a friendly tweet from @tomstafford who mentioned that this script was useful I’ve re-posted it here in preparation for the removal of my Newcastle University pages.

This script calculates and plots confidence intervals around a linear regression based on new observations. After I couldn’t find anything similar on the internet I developed my own implementation based on Statistics in Geography by David Ebdon (ISBN: 978-0631136880).

# linfit.py - example of confidence limit calculation for linear regression fitting. # References: # - Statistics in Geography by David Ebdon (ISBN: 978-0631136880) # - Reliability Engineering Resource Website: # - http://www.weibull.com/DOEWeb/confidence_intervals_in_simple_linear_regression.htm # - University of Glascow, Department of Statistics: # - http://www.stats.gla.ac.uk/steps/glossary/confidence_intervals.html#conflim import numpy as np import matplotlib.pyplot as plt # example data x = np.array([4.0,2.5,3.2,5.8,7.4,4.4,8.3,8.5]) y = np.array([2.1,4.0,1.5,6.3,5.0,5.8,8.1,7.1]) # fit a curve to the data using a least squares 1st order polynomial fit z = np.polyfit(x,y,1) p = np.poly1d(z) fit = p(x) # get the coordinates for the fit curve c_y = [np.min(fit),np.max(fit)] c_x = [np.min(x),np.max(x)] # predict y values of origional data using the fit p_y = z[0] * x + z[1] # calculate the y-error (residuals) y_err = y -p_y # create series of new test x-values to predict for p_x = np.arange(np.min(x),np.max(x)+1,1) # now calculate confidence intervals for new test x-series mean_x = np.mean(x) # mean of x n = len(x) # number of samples in origional fit t = 2.31 # appropriate t value (where n=9, two tailed 95%) s_err = np.sum(np.power(y_err,2)) # sum of the squares of the residuals confs = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((p_x-mean_x),2)/ ((np.sum(np.power(x,2)))-n*(np.power(mean_x,2)))))) # now predict y based on test x-values p_y = z[0]*p_x+z[0] # get lower and upper confidence limits based on predicted y and confidence intervals lower = p_y - abs(confs) upper = p_y + abs(confs) # set-up the plot plt.axes().set_aspect('equal') plt.xlabel('X values') plt.ylabel('Y values') plt.title('Linear regression and confidence limits') # plot sample data plt.plot(x,y,'bo',label='Sample observations') # plot line of best fit plt.plot(c_x,c_y,'r-',label='Regression line') # plot confidence limits plt.plot(p_x,lower,'b--',label='Lower confidence limit (95%)') plt.plot(p_x,upper,'b--',label='Upper confidence limit (95%)') # set coordinate limits plt.xlim(0,11) plt.ylim(0,11) # configure legend plt.legend(loc=0) leg = plt.gca().get_legend() ltext = leg.get_texts() plt.setp(ltext, fontsize=10) # show the plot plt.show()

Advertisements

Very useful script, I have been looking for something like this. Thanks!

(line 45: that should be + z[1], right?)

Second this – I was wondering why my confidence bounds were both lower than the data.

Works great for me now, and here is a comment that would have saved heaps of time!

agreed, line 45 correction: p_y = z[0] * p_x + z[1]

great and useful summarization, thank you kindly!

Hi Tom – thank you so much for this code. I was going crazy trying to figure out how to do this!

I edited your code a bit and tried to make it automatically select the appropriate t-value. If anyone wants to take a look it’s at:

https://github.com/HappyPenguin/STATISTICS/blob/master/CIs_LinearRegression.py

(All comments welcome, especially if I’ve made a mistake on the t stat selection!!)

Thanks Kirstie, sounds great will take a look.

Also,

# plot line of best fit

plt.plot(c_x,c_y,’r-‘,label=’Regression line’)

should be

# plot line of best fit

plt.plot(p_x,p_y,’r-‘,label=’Regression line’)

Actually, a better idea would be to fix it higher up: Change

# get the coordinates for the fit curve

c_y = [np.min(fit),np.max(fit)]

c_x = [np.min(x),np.max(x)]

to

# get the coordinates for the fit curve

c_x = [np.min(x),np.max(x)]

c_y = p(c_x)

Would you mind explaining the formula for computing confidence interval (confs = t * …)? Where does it come from? And I am a bit confused why lower and upper confidence limit should curve, not a straight line. Thanks

I think there might be an error in this:

confs = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((p_x-mean_x),2)/

((np.sum(np.power(x,2)))-n*(np.power(mean_x,2))))))

Specifically:

((np.sum(np.power(x,2)))-n*(np.power(mean_x,2)))))

Correct me if I’m wrong but it looks like you may have incorrectly simplified that.

Also, your equation appears to be missing the addition of the constant 1. Was there a specific reason for this?

Please see this as reference:

https://onlinecourses.science.psu.edu/stat414/node/298