operation between columns according to the value it contains - python

I have a Dataframe that look like this:
df_1:
Phase_1 Phase_2 Phase_3
0 8 4 2
1 4 6 3
2 8 8 3
3 10 5 8
...
I'd like to add a column called "Coeff" that compute (Phase_max - Phase_min) / Phase_max
For the first row: Coeff= (Phase_1 - Phase_3)/ Phase_1 = (8-2)/8 = 0.75
Expected OUTPUT:
df_1
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.75
1 4 6 3 0.5
2 8 8 3 0.625
3 10 5 8 0.5
What is the best way to compute this without using loop? I want to apply it on large dataset.

here is one way to do it
# list the columns, you like to use in calculations
cols=['Phase_1', 'Phase_2', 'Phase_3']
# using max and min across the axis to calculate, for the defined columns
df['coeff']=(df[cols].max(axis=1).sub(df[cols].min(axis=1))).div(df[cols].max(axis=1))
df
little performance optimization (credit Yevhen Kuzmovych)
df['coeff']= 1 - (df[cols].min(axis=1).div(df[cols].max(axis=1)))
df
Phase_1 Phase_2 Phase_3 coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500

As per OP specification
I only want the max or the min between Phase_1 Phase_2 and Phase_3 and not other columns
The following will do the work
columns = ['Phase_1', 'Phase_2', 'Phase_3']
max_phase = df[columns].max(axis = 1)
min_phase = df[columns].min(axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
max_phase = df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1)
min_phase = df[['Phase_1', 'Phase_2', 'Phase_3']].min(axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
df['Coeff'] = (df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1) - df[['Phase_1', 'Phase_2', 'Phase_3']].min(axis = 1)) / df[['Phase_1', 'Phase_2', 'Phase_3']].max(axis = 1)
[Out]:
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500
Another alternative would be to use numpy built-in modules, as follows
columns = ['Phase_1', 'Phase_2', 'Phase_3']
max_phase = np.max(df[columns], axis = 1)
min_phase = np.min(df[columns], axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
max_phase = np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
min_phase = np.min(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
df['Coeff'] = (max_phase - min_phase) / max_phase
# or
df['Coeff'] = (np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1) - np.min(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)) / np.max(df[['Phase_1', 'Phase_2', 'Phase_3']], axis = 1)
[Out]:
Phase_1 Phase_2 Phase_3 Coeff
0 8 4 2 0.750
1 4 6 3 0.500
2 8 8 3 0.625
3 10 5 8 0.500

Related

Reagrding new column in dataframe and getting keyerror

this is my code
for i in range(len(df)-1):
df['ot'].iloc[i]=(df['Open'].iloc[i]-df['Open'].iloc[i+1])/df['Open'].iloc[i+1]
print(df['ot'])
Here the ot is new column just created and open is dervied in dataframe, while I try to print ot after assigning that to the formula I get keyerror.
Replace your loop by vectorization:
df['ot'] = df['Open'].diff(-1) / df['Open'].shift(-1)
print(df)
# Output
Open ot
1 2 -0.50 # (2 - 4) / 4 = -0.5
2 4 1.00 # (4 - 2) / 2 = 1
3 2 -0.75 # (2 - 8) / 8 = -0.75
4 8 NaN
It looks like pct_change:
df['ot'] = df['open'].pct_change(-1)
Using #Corralien's example:
Open ot
0 2 -0.50
1 4 1.00
2 2 -0.75
3 8 NaN

subtract value from next rows until condition then subtract new value

say I have:
df ={'animal' : [1, 1, 1, 1, 1, 1, 1, 2, 2],
'x':[76.551, 77.529, 78.336,77, 78.02, 79.23, 77.733, 79.249, 76.077],
'y': [151.933, 152.945, 153.970, 119.369, 120.615, 118.935, 119.115, 152.004, 153.027],
'time': [0, 1, 2, 0, 3,2,5, 0, 1]}
df = pd.DataFrame(df)
# get distance travelled between points
def get_diff(df):
dx = (df['x'] - df.groupby('animal')['x'].shift(1))
dy = (df['y'] - df.groupby('animal')['y'].shift(1))
df['distance'] = (dx**2 + dy**2)**0.5
return df
# get the start/end coordinates
def get_startend(df):
for i in range(len(df)):
df.loc[df['distance'] > 5, 'start'] = 'start' #if distance >5, assign 'start'
df.loc[df['distance'].isnull(), 'start'] = 'start' #if distance = NaN, assign 'start'
cond = df['start'].shift(-1).str.contains('start').fillna(False) #for every idx before row 'start', assign 'end'
df.loc[cond, 'start'] = 'end'
df['start'].iloc[-1] = 'end' #assign 'end' for last idx
return df
df = get_diff(df)
df = get_startend(df)
after some preprocessing, I end up with:
animal x y time distance start
0 1 76.551 151.933 0 NaN start
1 1 77.529 152.945 1 1.407348 NaN
2 1 78.336 153.970 2 1.304559 end
3 1 77.000 119.369 0 34.626783 start
4 1 76.020 120.615 3 1.585218 NaN
5 1 79.230 118.935 2 3.623051 NaN
6 1 77.733 119.115 5 1.507783 end
7 2 79.249 152.004 0 NaN start
8 2 76.077 153.027 1 3.332884 end
I want to artificially recenter the start coordinates at (0,0). So if the start column has 'start', then subtract the x,y values from all the following rows until reach the next start index, then subtract the new x,y values, etc.
output should look something like:
animal x y time distance start newX newY
0 1 76.551 151.933 0 NaN start 0 0 #(76.551-76.551, 151.993-151.933)
1 1 77.529 152.945 1 1.407348 NaN 0.978 1.012 #(77.529-76.551, 152.945-151.933)
2 1 78.336 153.970 2 1.304559 end 1.785 2.012 #(78.336-76.551, 153.970-151.933)
3 1 77.000 119.369 0 34.626783 start 0 0 #(77-77, 119.369-119.369)
4 1 76.020 120.615 3 1.610253 NaN -0.98 1.246 #(76.020-77, 120.615-119.363)
5 1 79.230 118.935 2 3.623051 NaN 2.23 -0.434 #(..., ...)
6 1 77.733 119.115 5 1.507783 end 0.733 -0.254
7 2 79.249 152.004 0 NaN start 0 0 #(79.249-79.249, 152.004-152.004)
8 2 76.077 153.027 1 3.332884 end -3.172 1.023 #(76.077-79.249,153.027-152.004)
You can create a boolean mask based on start, and then use cumsum to turn that into a perfect grouper. Group by it, and then get the first value of x and y for each group. Subtract x and y from those firsts and you have your new columns:
df[['newX', 'newY']] = df[['x', 'y']] - df.groupby(df['start'].eq('start').cumsum())[['x', 'y']].transform('first')
Output:
animal x y time distance start newX newY
0 1 76.551 151.933 0 NaN start 0.000 0.000
1 1 77.529 152.945 1 1.407348 NaN 0.978 1.012
2 1 78.336 153.970 2 1.304559 NaN 1.785 2.037
3 1 77.000 119.369 0 34.626783 start 0.000 0.000
4 1 76.020 120.615 3 1.585218 NaN -0.980 1.246
5 1 79.230 118.935 2 3.623051 NaN 2.230 -0.434
6 1 77.733 119.115 5 1.507783 NaN 0.733 -0.254
7 2 79.249 152.004 0 NaN start 0.000 0.000
8 2 76.077 153.027 1 3.332884 NaN -3.172 1.023
You can use diff to compute the difference between previous rows.
df[['new_x', 'new_y']] = \
df.groupby(df['start'].notna().cumsum())[['x', 'y']].diff().fillna(0)
print(df)
# Output
animal x y time distance start new_x new_y
0 1 76.551 151.933 0 NaN start 0.000 0.000
1 1 77.529 152.945 1 1.407348 NaN 0.978 1.012
2 1 78.336 153.970 2 1.304559 NaN 0.807 1.025
3 1 77.000 119.369 0 34.626783 start 0.000 0.000
4 1 78.020 120.615 3 1.610253 NaN 1.020 1.246
5 1 79.230 118.935 2 2.070386 NaN 1.210 -1.680
6 1 77.733 119.115 5 1.507783 NaN -1.497 0.180
7 2 79.249 152.004 0 NaN start 0.000 0.000
8 2 76.077 153.027 1 3.332884 NaN -3.172 1.023

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?
Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Python: how to find values in a dataframe without loop?

I have two dataframes net and M.
net =
i j d
0 5 3 3
1 2 0 2
2 3 2 1
3 4 5 2
4 0 1 3
5 0 3 4
M =
0 1 2 3 4 5
0 0 3 2 4 1 5
1 3 0 2 0 3 3
2 2 2 0 1 1 4
3 4 0 1 0 3 3
4 1 3 1 3 0 2
5 5 3 4 3 2 0
I want to find in M the same values of net['d'], choose randomly a cell in M and create a new dataframe containing the coordinate of that cell. For instance
net['d'][0] = 3
so in M I find:
M[0][1]
M[1][0]
M[1][4]
M[1][5]
...
Finally net1 would be something like that
net1 =
i1 j1 d1
0 1 5 3
1 5 4 2
2 2 3 1
3 1 2 2
4 1 5 3
5 3 0 4
This what I am doing:
I1 = []
J1 = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( M == tmp)
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
I1.append(h)
J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']
I am wondering which is the best way to avoid that loop
You can stack the columns of M and then just sample it with replacement
net = pd.DataFrame({'i':[5,2,3,4,0,0],
'j':[3,0,2,5,1,3],
'd':[3,2,1,2,3,4]})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
def random_net(net, M):
# make long table and randomize order of rows and rename columns
net1 = M.stack().reset_index()
net1.columns =['i1', 'j1', 'd1']
# get size of each group for random mapping
net1_id_length = net1.groupby('d1').size()
# add id column to uniquely identify row in net
net_copy = net.copy()
# first map gets size of each group and second gets random integer
net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
net1['id'] = net1.groupby('d1').cumcount()
# make for easy lookup
net_copy = net_copy.set_index(['d', 'id'])
net1 = net1.set_index(['d1', 'id'])
# choose from net1 only those from original net
return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})
random_net(net, M)
output
d1 i1 j1
0 3 5 1
1 2 0 2
2 1 3 2
3 2 1 2
4 3 3 5
5 4 0 3
Timings on 6 million rows
n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n,
'j':[3,0,2,5,1,3] * n,
'd':[3,2,1,2,3,4] * n})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
%timeit random_net(net, M)
1 loop, best of 3: 13.7 s per loop

Categories

Resources