I am trying to implement linear regression using python.
I did the following steps:
import pandas as p
import numpy as n
data = p.read_csv("...path\Housing.csv", usecols=[1]) # I want the first col
data1 = p.read_csv("...path\Housing.csv", usecols=[3]) # I want the 3rd col
x = data
y = data1
Then I try to obtain the co-efficients, and use the following:
regression_coeff = n.polyfit(x,y,1)
And then I get the following error:
raise TypeError("expected 1D vector for x")
TypeError: expected 1D vector for x
I am unable to get my head around this, as when I print x and y, I can very clearly see that they are both 1D vectors.
Can someone please help?
Dataset can be found here: DataSets
The original code is:
import pandas as p
import numpy as n
data = pd.read_csv('...\housing.csv', usecols = [1])
data1 = pd.read_csv('...\housing.csv', usecols = [3])
x = data
y = data1
regression = n.polyfit(x, y, 1)
This should work:
np.polyfit(data.values.flatten(), data1.values.flatten(), 1)
data is a dataframe and its values are 2D:
>>> data.values.shape
(546, 1)
flatten() turns it into 1D array:
>> data.values.flatten().shape
(546,)
which is needed for polyfit().
Simpler alternative:
df = pd.read_csv("Housing.csv")
np.polyfit(df['price'], df['bedrooms'], 1)
pandas.read_csv() returns a DataFrame, which has two dimensions while np.polyfit wants a 1D vector for both x and y for a single fit. You can simply convert the output of read_csv() to a pd.Series to match the np.polyfit() input format using .squeeze():
data = pd.read_csv('../Housing.csv', usecols = [1]).squeeze()
data1 = p.read_csv("...path\Housing.csv", usecols=[3]).squeeze()
Python is telling you that the data is not in the right format, in particular x must be a 1D array, in your case it is a 2D-ish panda array.
You can transform your data in a numpy array and squeeze it to fix your problem.
import pandas as pd
import numpy as np
data = pd.read_csv('../Housing.csv', usecols = [1])
data1 = pd.read_csv('../Housing.csv', usecols = [3])
data = np.squeeze(np.array(data))
data1 = np.squeeze(np.array(data1))
x = data
y = data1
regression = np.polyfit(x, y, 1)
Related
I am trying to create some code that gives weight to the most impactful features.
My dataframe contains both nominal and categorical data.
example data:
[Brand] [Model] [Car_price] [...] [Prime]
BMW X1 40,000 300
The Y is the prime and X is all other columns.
I tried using the following:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data, delimiter=";")
#df = df.dropna(axis=1)
array = df.values
X = array[:,(6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,34,35,37,44,45,47,48,54,61,62)]
Y = array[:,51]
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, Y)
And get the following error: ValueError: could not convert string to float
I know there is a way to transform from string into numerical data, but was wondering if it is necessary. What fixes can I apply to get weighted features?
I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?
You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min
MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range
I seem to be unable to reproduce the 90% percentile of a distribution when using multi-column groupby on a large data set:
data.loc[(data.x=='2008Q1')&(data.y==-90)]['var'].quantile(0.9)
out: 1.030292
groupby_var = data.groupby(['x','y'])['var'].quantile(0.9).reset_index().rename(columns={'var':'u_var'})
groupby_var.loc[(groupby_var.x=='2008Q1')&(groupby_var.y==-90)]['u_var']
out: 0.187166
DataFrame data consists of 68M rows. x is string/object, y is float, var is float.
What am I doing wrong here? Result is way off.
Update:
Problem is related to missing values of y. Reproducible example:
import pandas as pd
import random
import numpy as np
random.seed(0)
n=68*10**6
x_data = [str(i)+'Q'+str(j) for i in range(1950,2021) for j in range(1,5)]
y_data = [i for i in range(-90,91)]+[np.nan]
var_data = [random.randrange(0,10000)/10000 for i in range(n)]
data = pd.DataFrame(var_data,columns=['var'])
data['x'] = random.choices(x_data,k=n)
data['y'] = random.choices(y_data,k=n)
data['y'] = data['y'].astype(float)
data.loc[(data.x=='2008Q1')&(data.y==-90)]['var'].quantile(0.9)
out: 0.891
groupby_var = data.groupby(['x','y'])['var'].quantile(0.9).reset_index().rename(columns={'var':'u_var'})
groupby_var.loc[(groupby_var.x=='2008Q1')&(groupby_var.y==-90)]['u_var']
out: 0.8472
groupby_var_nan = data.loc[data['y'].notna()].groupby(['x','y'])['var'].quantile(0.9).reset_index().rename(columns={'var':'u_var'})
groupby_var_nan.loc[(groupby_var_nan.x=='2008Q1')&(groupby_var_nan.y==-90)]['u_var']
out: 0.891
Question: Why is the result of groupby_var.loc[(groupby_var.x=='2008Q1')&(groupby_var.y==-90)]['u_var'] not the same as data.loc[(data.x=='2008Q1')&(data.y==-90)]['var'].quantile(0.9) and groupby_var_nan.loc[(groupby_var_nan.x=='2008Q1')&(groupby_var_nan.y==-90)]['u_var'] ?
Is this expected behavior?
Isn't this a bug of some sort?
I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.
I load images with scipy's misc.imread, which returns in my case 2304x3 ndarray. Later, I append this array to the list and convert it to a DataFrame. The purpose of doing so is to later apply Isomap transform on the DataFrame. My data frame is 84 rows/samples (images in the folder) and 2304 features each feature is array/list of 3 elements. When I try using Isomap transform I get error:
ValueError: setting an array element with a sequence.
I think error is there because elements of my data frame are of the object type. First I tried using a conversion to_numeric on each column, but got an error, then I wrote a loop to convert each element to numeric. The results I get are still of the object type. Here is my code:
import pandas as pd
from scipy import misc
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import matplotlib.pyplot as plt
import glob
from sklearn import manifold
samples = []
path = 'Datasets/ALOI/32/*.png'
files = glob.glob(path)
for name in files:
img = misc.imread(name)
img = img[::2, ::2]
x = (img/255.0).reshape(-1,3)
samples.append(x)
df = pd.DataFrame.from_records(samples, coerce_float = True)
for i in range(0,2304):
for j in range(0,84):
df[i][j] = pd.to_numeric(df[i][j], errors = 'coerce')
df[i] = pd.to_numeric(df[i], errors = 'coerce')
print df[2303][83]
print df[2303].dtype
print df[2303][83].dtype
#iso = manifold.Isomap(n_neighbors=6, n_components=3)
#iso.fit(df)
#manifold = iso.transform(df)
#print manifold.shape
Last four lines commented out because they give an error. The output I get is:
[ 0.05098039 0.05098039 0.05098039]
object
float64
As you can see each element of DataFrame is of the type float64 but whole column is an object.
Does anyone know how to convert whole data frame to numeric?
Is there another way of applying Isomap?
Do you want to reshape your image to a new shape instead of the original one?
If that is not the case then you should change the following line in your code
x = (img/255.0).reshape(-1,3)
with
x = (img/255.0).reshape(-1)
Hope this will resolve your issue