I have a dataframe called 'erm' like this:
enter image description here
I would like to add a new column 'typeRappel' xith value = 1 if erm['Calcul'] has value 4.
This is my code:
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
But this code send me an error like this:
enter image description here
What can be the problem ??
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
You can achieve what you want using lambda
import pandas as pd
df = pd.DataFrame(
data=[[1,2],[4,5],[7,8],[4,11]],
columns=['Calcul','other_col']
)
df['typeRappel'] = df['Calcul'].apply(lambda x: 1 if x == 4 else None)
This results in
Calcul
other_col
typeRappel
1
2
NaN
4
5
1.0
7
8
NaN
4
11
1.0
You have 2 way for this
first way:
use from .loc method because you have just 1 condition
df['new']=None
df.loc[df.calcul.eq(4), 'new'] =1
Second way:
use from numpy.select method
import numpy as np
cond=[df.calcul.eq(4)]
df['new']= np.select(cond, [1], None)
import numpy as np
import pandas as pd
#erm['typeRappel']=None
erm.loc[erm.Calcul.eq(4), 'typeRappel'] = 1
import numpy as np
cond=[erm.Calcul.eq(4)]
erm['ok']= np.select(cond, [1], None)
Related
I have simple question.
i want to get linear value with dataframe.(like N-D Lookuptable in Matlab simulink)
type python
import numpy as np
import scipy as sp
import pandas as pd
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
df = pd.DataFrame(W,index=Y,columns=X)
# 100 200 300 400 500
#1 1 1 1 1 1
#2 2 2 2 2 2
#3 3 3 3 3 3
#want function.
#Ex
#input x = 150 y= 1
#result = 1
#input x = 100 y = 1.5
#result = 1.5
somebody let me know, if there is a function or a lib, or how can I do this, or Some beetter way.
I just want to make a filter or getting function with some array data, using Numpy, Pandas or Scipy, if possible.
You are looking for scipy.interpolate.RectBivariateSpline:
#lib
import numpy as np
from scipy import interpolate
#input
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
#solution
spline = interpolate.RectBivariateSpline(X, Y, W.T, kx=1, ky=1)
#example
spline(150, 1) #returns 1
spline(100, 1.5) #returns 1.5
Let me know if it solves the issue.
My toy example is as follows:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
### prepare data
Xy = np.c_[load_iris(return_X_y=True)]
mycol = ['x1','x2','x3','x4','group']
df = pd.DataFrame(data=Xy, columns=mycol)
dat = df.iloc[:100,:] #only consider two species
dat['group'] = dat.group.apply(lambda x: 1 if x ==0 else 2) #two species means two groups
dat.shape
dat.head()
### Linear discriminant analysis procedure
G1 = dat.iloc[:50,:-1]; x1_bar = G1.mean(); S1 = G1.cov(); n1 = G1.shape[0]
G2 = dat.iloc[50:,:-1]; x2_bar = G2.mean(); S2 = G2.cov(); n2 = G2.shape[0]
Sp = (n1-1)/(n1+n2-2)*S1 + (n2-1)/(n1+n2-2)*S2
a = np.linalg.inv(Sp).dot(x1_bar-x2_bar); u_bar = (x1_bar + x2_bar)/2
m = a.T.dot(u_bar); print("Linear discriminant boundary is {} ".format(m))
def my_lda(x):
y = a.T.dot(x)
pred = 1 if y >= m else 2
return y.round(4), pred
xx = dat.iloc[:,:-1]
xxa = xx.agg(my_lda, axis=1)
xxa.shape
type(xxa)
We have xxa is a pandas.core.series.Series with shape (100,). Note that there are two columns in parentheses of xxa, I want convert xxa to a pd.DataFrame with 100 rows x 2 columns and I try
xxa_df1 = pd.DataFrame(data=xxa, columns=['y','pred'])
which gives ValueError: Shape of passed values is (100, 1), indices imply (100, 2).
Then I continue to try
xxa2 = xxa.to_frame()
# xxa2 = pd.DataFrame(xxa) #equals `xxa.to_frame()`
xxa_df2 = pd.DataFrame(data=xxa2, columns=['y','pred'])
and xxa_df2 presents all NaN with 100 rows x 2 columns. What should I do next?
Let's try Series.tolist()
xxa_df1 = pd.DataFrame(data=xxa.tolist(), columns=['y','pred'])
print(xxa_df1)
y pred
0 42.0080 1
1 32.3859 1
2 37.5566 1
3 31.0958 1
4 43.5050 1
.. ... ...
95 -56.9613 2
96 -61.8481 2
97 -62.4983 2
98 -38.6006 2
99 -61.4737 2
[100 rows x 2 columns]
I want to shuffle a pandas dataframe 'n' times and save the shuffled dataframe with a new name and then export it to a 'csv' file. What I mean is-
import pandas as pd
import sklearn
import numpy as np
from sklearn.utils import shuffle
df = pd.read_csv('example.csv')
Then something like this-
for i in np.arange(n):
df_%i = shuffle(df)
df_%i.to_csv('example.csv')
I appreciate any help. Thanks!
You can use
for i in range(n):
df.sample(frac= 1).to_csv(f"example_{i}.csv")
If you need to create an arbitrary number of variables, you should store them in a dictionary and you can reference them later by their keys; in this case the integer you loop over.
d = {}
for i in range(n):
d[i] = df.sample(frac=1) #d[i] = shuffle(df) in your case
d[i].to_csv(f'example_{i}.csv')
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)))
d = {}
for i in range(5):
d[i] = df.sample(frac=1)
d[1]
# 0 1 2
#0 6 3 2
#1 7 6 4
#2 2 6 9
d[2]
# 0 1 2
#2 2 6 9
#1 7 6 4
#0 6 3 2
Let say I'm in the following situation:
import pandas as pd
import dask.dataframe as dd
import random
s = "abcd"
lst = 10*[0]+list(range(1,6))
n = int(1e2)
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
"col2": [random.choice(lst) for i in range(n)]})
df["idx"] = df.col1
df = df[["idx","col1","col2"]]
def fun(data):
if data["col2"].mean()>1:
return 2
else:
return 1
df.set_index("idx", inplace=True)
ddf1 = dd.from_pandas(df, npartitions=4)
gpb = ddf1.groupby("col1").apply(fun, meta=pd.Series(name='col3'))
ddf2 = ddf1.join(gpb.to_frame(), on="col1")
While ddf1.known_divisions is True ddf2.known_divisions is False I would like to preserve the same division on the ddf2 dataframe.
In one random example what I even got an empty partition.
for i in range(ddf1.npartitions):
print(i, len(ddf1.get_partition(i)), len(ddf2.get_partition(i)))
0 27 50
1 29 0
2 23 21
3 21 29
Hello I am trying to convert this function to pandas as I am not familiar with R
sum(data_file$finished_race_date >= 0, na.rm = TRUE)/sum(data_file$signup_race_date >= 0, na.rm = TRUE)
I am trying to figure out what percentage of runners finished the race
If need divide sum of True values in 2 boolean masks comparing by notnull:
100 * data_file.finished_race_date.notnull().sum()/data_file.signup_race_date.notnull().sum()
Sample:
import pandas as pd
import numpy as np
data_file = pd.DataFrame({'finished_race_date':['2/5/16',np.nan,np.nan],
'signup_race_date':[np.nan,'2/5/16','2/5/16']})
print (data_file)
finished_race_date signup_race_date
0 2/5/16 NaN
1 NaN 2/5/16
2 NaN 2/5/16
print (data_file.finished_race_date.notnull())
0 True
1 False
2 False
Name: finished_race_date, dtype: bool
print (data_file.finished_race_date.notnull().sum())
1
finished_race_date = data_file.finished_race_date.notnull().sum()
signup_race_date = data_file.signup_race_date.notnull().sum()
print (100 * finished_race_date / signup_race_date)
50.0