How to get linear value from Dataframe - python

I have simple question.
i want to get linear value with dataframe.(like N-D Lookuptable in Matlab simulink)
type python
import numpy as np
import scipy as sp
import pandas as pd
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
df = pd.DataFrame(W,index=Y,columns=X)
# 100 200 300 400 500
#1 1 1 1 1 1
#2 2 2 2 2 2
#3 3 3 3 3 3
#want function.
#Ex
#input x = 150 y= 1
#result = 1
#input x = 100 y = 1.5
#result = 1.5
somebody let me know, if there is a function or a lib, or how can I do this, or Some beetter way.
I just want to make a filter or getting function with some array data, using Numpy, Pandas or Scipy, if possible.

You are looking for scipy.interpolate.RectBivariateSpline:
#lib
import numpy as np
from scipy import interpolate
#input
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
#solution
spline = interpolate.RectBivariateSpline(X, Y, W.T, kx=1, ky=1)
#example
spline(150, 1) #returns 1
spline(100, 1.5) #returns 1.5
Let me know if it solves the issue.

Related

add a new column to a serie pandas with condition

I have a dataframe called 'erm' like this:
enter image description here
I would like to add a new column 'typeRappel' xith value = 1 if erm['Calcul'] has value 4.
This is my code:
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
But this code send me an error like this:
enter image description here
What can be the problem ??
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
You can achieve what you want using lambda
import pandas as pd
df = pd.DataFrame(
data=[[1,2],[4,5],[7,8],[4,11]],
columns=['Calcul','other_col']
)
df['typeRappel'] = df['Calcul'].apply(lambda x: 1 if x == 4 else None)
This results in
Calcul
other_col
typeRappel
1
2
NaN
4
5
1.0
7
8
NaN
4
11
1.0
You have 2 way for this
first way:
use from .loc method because you have just 1 condition
df['new']=None
df.loc[df.calcul.eq(4), 'new'] =1
Second way:
use from numpy.select method
import numpy as np
cond=[df.calcul.eq(4)]
df['new']= np.select(cond, [1], None)
import numpy as np
import pandas as pd
#erm['typeRappel']=None
erm.loc[erm.Calcul.eq(4), 'typeRappel'] = 1
import numpy as np
cond=[erm.Calcul.eq(4)]
erm['ok']= np.select(cond, [1], None)

Converting pandas.core.series.Series to dataframe with multiple column names

My toy example is as follows:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
### prepare data
Xy = np.c_[load_iris(return_X_y=True)]
mycol = ['x1','x2','x3','x4','group']
df = pd.DataFrame(data=Xy, columns=mycol)
dat = df.iloc[:100,:] #only consider two species
dat['group'] = dat.group.apply(lambda x: 1 if x ==0 else 2) #two species means two groups
dat.shape
dat.head()
### Linear discriminant analysis procedure
G1 = dat.iloc[:50,:-1]; x1_bar = G1.mean(); S1 = G1.cov(); n1 = G1.shape[0]
G2 = dat.iloc[50:,:-1]; x2_bar = G2.mean(); S2 = G2.cov(); n2 = G2.shape[0]
Sp = (n1-1)/(n1+n2-2)*S1 + (n2-1)/(n1+n2-2)*S2
a = np.linalg.inv(Sp).dot(x1_bar-x2_bar); u_bar = (x1_bar + x2_bar)/2
m = a.T.dot(u_bar); print("Linear discriminant boundary is {} ".format(m))
def my_lda(x):
y = a.T.dot(x)
pred = 1 if y >= m else 2
return y.round(4), pred
xx = dat.iloc[:,:-1]
xxa = xx.agg(my_lda, axis=1)
xxa.shape
type(xxa)
We have xxa is a pandas.core.series.Series with shape (100,). Note that there are two columns in parentheses of xxa, I want convert xxa to a pd.DataFrame with 100 rows x 2 columns and I try
xxa_df1 = pd.DataFrame(data=xxa, columns=['y','pred'])
which gives ValueError: Shape of passed values is (100, 1), indices imply (100, 2).
Then I continue to try
xxa2 = xxa.to_frame()
# xxa2 = pd.DataFrame(xxa) #equals `xxa.to_frame()`
xxa_df2 = pd.DataFrame(data=xxa2, columns=['y','pred'])
and xxa_df2 presents all NaN with 100 rows x 2 columns. What should I do next?
Let's try Series.tolist()
xxa_df1 = pd.DataFrame(data=xxa.tolist(), columns=['y','pred'])
print(xxa_df1)
y pred
0 42.0080 1
1 32.3859 1
2 37.5566 1
3 31.0958 1
4 43.5050 1
.. ... ...
95 -56.9613 2
96 -61.8481 2
97 -62.4983 2
98 -38.6006 2
99 -61.4737 2
[100 rows x 2 columns]

How to quickly sum across columns for every permutation of rows in Python

Suppose I have a n x k matrix X. And I want to get the sum across the columns, but for every permutation of the rows. So if my matrix is [[1,2],[3,4]] my desired output would be [1+2, 1+4, 3+2, 3+4]. I produce a MWE example with my first attempt at a solution. I'm hoping I can get some help to reduce the computation time.
My actual problem has n=160 and k=4, and it takes quite a while to run (as of writing this, it's still running).
import pandas as pd
import numpy as np
import itertools
n = 4
k = 3
X = np.random.randint(0, 10, (n, k))
df = pd.DataFrame(X)
df
0 1 2
0 2 9 2
1 7 6 4
2 3 7 0
3 5 0 0
ixi = df.index.tolist()
ixc = df.columns.tolist()
psum = np.array([df.lookup(i, ixc).sum() for i in
itertools.product(ixi, repeat=len(ixc))])
You can try functools.reduce:
from functools import reduce
reduce(np.add.outer, df.values.T).ravel()

Performing math on a Python Pandas Group By DataFrame

I have a Pandas DataFrame with the following structure:
In [1]: df
Out[1]:
location_code month amount
0 1 1 10
1 1 2 11
2 1 3 12
3 1 4 13
4 1 5 14
5 1 6 15
6 2 1 23
7 2 2 25
8 2 3 27
9 2 4 29
10 2 5 31
11 2 6 33
I also have a DataFrame with the following:
In [2]: output_df
Out[2]:
location_code regression_coef
0 1 None
1 2 None
What I would like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
I would like to group by the location code and then perform a linear regression on the values of amount and store the coefficient. I have tried the following code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = []
for j in range(6): x.append(j+1)
for location_code, amount in gb:
trans = amount.tolist()
x = sm.add_constant(x)
model = sm.OLS(trans, x)
results = model.fit()
output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.
SOLUTION
After some tinkering around, I wrote a function that can be used with the apply method on a groupby.
def get_lin_reg_coef(series):
x=sm.add_constant(range(1,7))
result = sm.OLS(series, x).fit().params[1]
return result/series.mean()
gb = df.groupby('location_code')['amount']
output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
Benchmarking this versus the iterative solution I had before, with varying input sizes gets:
DataFrame Rows Iterative Solution (sec) Vectorized Solution (sec)
370,000 81.42 84.46
1,850,000 448.36 365.66
3,700,000 1282.83 715.89
7,400,000 5034.62 1407.88
Clearly a lot faster as the dataset grows in size!
Without knowing more about the data, number of records, etc, this code should run faster:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = sm.add_constant(range(1,7))
def fit(stuff):
return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()
output = gb.apply(fit)

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.

Categories

Resources