#Linear regression model code in jupyter with sklearn
23 messages · Page 1 of 1 (latest)
What do you means by "low numbers in my scores"
In my rsquared result. I'm wondering what parameters can I change to increase it.
R^2 is typically a very limited metric and may not always be a good metric to be evaluating your model on
Could you show us what you've done
And any plots of your data you've made?
import matplotlib as plt
import seaborn as sb
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
dataset = pd.read_csv('kc_house_data.csv')
dataset.fillna(0, inplace=True)
for i in dataset.columns:
dataset[i]=dataset[i].astype(float)
dataset['house_age'] = 2023 - dataset['yr_built']
dataset['reno_age'] = 2023 - dataset['yr_renovated']
dataset['reno_age'] = dataset.reno_age.apply(lambda x: x if len(str(int(x)))==2 else 0.0)
dataset.drop(['yr_built', 'yr_renovated'], axis=1, inplace=True)
X = dataset.iloc[ : , 1: ]
Y = dataset.iloc[ : , 0]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.1, random_state=3)
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
svregressor = SVR(kernel = 'linear' )
svregressor.fit(X_train, Y_train)
dtregressor = DecisionTreeRegressor()
dtregressor.fit(X_train, Y_train)
rfregressor = RandomForestRegressor()
rfregressor.fit(X_train, Y_train)
pred = regressor.predict(X_test)
predsv = svregressor.predict(X_test)
preddt = dtregressor.predict(X_test)
predrf = rfregressor.predict(X_test)
print(Y_test)
print(pred)
print(predsv)
print(preddt)
print(predrf)
r_square = metrics.r2_score(Y_test, pred)
r_squaresv = metrics.r2_score(Y_test, predsv)
r_squaredt = metrics.r2_score(Y_test, preddt)
r_squarerf = metrics.r2_score(Y_test, predrf)
print(r_square)
print(r_squaresv)
print(r_squaredt)
print(r_squarerf)
I can see a few issues with your code and your process
Have you done any explorations or visualisations of your data?
no thing yet
where would you suggest I start with visualising
Well I would first have a look at what columns you have and based on that, some data visualisations may be useful
For example, for numeric features, you may want to plot scatterplots to see if your data fits the assumptions for a linear regression
Or if there may be multicollinearity
I have run sb.pairplot(dataset) and can see it over, would you suggest dropping anything that dropping anything that doesnt have the linear regression factors
Not necessarily
You can always apply transformations to data if it's not linear
It's kind of hard for me to suggest what you can do because I don't have the data nor can I see what you have/your outputs
Are you following any guides/tutorials for this project or are you just trying it out yourself
we have been asked the below
Task 1: Regression Task (50% - 100 marks)
Create a regression model using algorithm 1 – state which algorithm and submit working source code (20 marks)
Create a regression model using algorithm 2 – state which algorithm and submit working source code (20 marks)
Report (60 marks) – you should use the points below as a guide, you may also write about different areas of what you have submitted regarding your code.
Showing accuracies of each model. Which did better? (10 marks)
Explanation of how the two models work (include references). Why did one perform better than the other for your dataset? (25 marks)
Suggest different methods to improve your models. E.g., Remove/add data? If so, which features would you build on? Think about visualising the data first and seeing which features do not help the prediction. (25 marks)
Task 2: Classification Task (predicting a class for new input) (50% - 100 marks)
Create a classification model using algorithm 1 – state which algorithm and submit working source code (20 marks)
Create a classification model using algorithm 2 – state which algorithm and submit working source code (20 marks)
Report (60 marks) – you should use the points below as a guide, you may also write about different areas of what you have submitted regarding your code.
Showing accuracies of each model. Which did better? (10 marks)
Explanation of how the two models work (include references). Why did one perform better than the other for your dataset? (25 marks)
Suggest different methods to improve your models. E.g. Remove/add data? If so, which features would you build on? Think about visualising the data first and seeing which features do not help the prediction. (25 marks)
im using the KC housing data for the regression from kaggle
the two models i was going to focus on would be SVR and Random tree