Use if condition in Python Pandas - python

I have one dataframe, I'm trying to perform if function on the data, say if column A is 'ON' then Column E should be Col C + Col D otherwise Column E is MAX(col B, col C)-col C + col D).
df1:
T_ID A B C D
1 ON 100 90 0
2 OFF 150 120 -20
3 OFF 200 150 0
4 ON 400 320 0
5 ON 100 60 -10
6 ON 250 200 0
Resulting Data frame
T_ID A B C D E
1 ON 100 90 0 90
2 OFF 150 120 -20 10
3 OFF 200 150 0 50
4 ON 400 320 0 320
5 ON 100 60 -10 50
6 ON 250 200 0 200
I am using the following code, any suggestion how can I do it in a better way?
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])

I think np.where here is good approach. For me working numpy.maximum, max raise error:
condition = df1['A'].eq('ON')
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
T_ID A B C D E
0 1 ON 100 90 0 90
1 2 OFF 150 120 -20 10
2 3 OFF 200 150 0 50
3 4 ON 400 320 0 320
4 5 ON 100 60 -10 50
5 6 ON 250 200 0 200
df1['E'] = np.where(condition,
df1['C'] + df1['D'],
max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here apply is worse solution, because loops under the hood, so slow:
#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)
In [73]: %%timeit
...: condition = df1['A'].eq('ON')
...:
...: df1['E1'] = np.where(condition,
...: df1['C'] + df1['D'],
...: np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
...:
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: df1['E2'] = df1.apply(createE, axis=1)
...:
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think apply function will be better solution.
The code may like this:
def createE(row):
if row.A == 'ON':
return row.C + row.D
else:
return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)
See more about apply at https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/

Related

How to standardise in place in pandas

I have a wide dataset:
id x0 x1 x2 x3 x4 x5 ... x10000 Type
1 40 31.05 25.5 25.5 25.5 25 ... 33 1
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 0
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 1
4 40 31.50 23.5 24.5 26.5 25 ... 33 1
...
900 40 31.05 25.5 25.5 25.5 25 ... 23 0
with each row being a time series. I would like to standardise in place all values except for the last column, with each row/time series as an independent distribution. I am thinking about appending 2 columns mean and std(standard deviation) to the rightmost of the dataframe, and standardise using apply. But it sounds cumbersome and might make mistakes in the process. How can I do this and is there an easier way? Thanks
Method 1:
We can use sklearn.preprocessing.scale! Set axis = 1 to scale data on each row!
This kind of data cleaning can be done nicely using sklearn.preprocessing.Here is an official doc
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and then standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Method 2:
Just use lambda function if your dataset is not huge.
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and than standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = data.loc[:,~data.columns.isin(['type'])].\
apply(lambda x: (x - np.mean(x))/np.std(x), axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Speed compare:
Method 1 is faster then method 2.
Method 1: 2.03 ms ± 205 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
Method 2: 3.06 ms ± 153 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])].apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)
You could compute mean and std manually:
stats = df.iloc[:,1:-1].agg(['mean','std'], axis=1) # axis=1 apply on rows
df.iloc[:, 1:-1] = (df.iloc[:, 1:-1]
.sub(stats['mean'], axis='rows') # axis='rows' apply on rows
.div(stats['std'],axis='rows')
)
output:
id x0 x1 x2 x3 x4 x5 x10000 Type
0 1 1.87515 0.297204 -0.681302 -0.681302 -0.681302 -0.769456 0.641003 1
1 2 0.31841 0.499129 0.679848 -1.72974 0.679848 0.679848 -1.12734 0
2 3 -0.0363456 0.218074 0.508839 0.508839 0.508839 0.508839 -2.21708 1
3 4 1.81012 0.392987 -0.940787 -0.774066 -0.440622 -0.690705 0.64307 1

Summing the different values in pandas data frame

I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.

Updating several columns at once using iloc

How I can update several columns of a row in more optimized way ?
masks_df.iloc[mask_row.Index, shelf_number_idx] = tags_on_mask['shelf_number'].iloc[0]
masks_df.iloc[mask_row.Index, stacking_layer_idx] = tags_on_mask['stacking_layer'].iloc[0]
masks_df.iloc[mask_row.Index, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].iloc[0]
Thanks.
Use:
tags_on_mask = pd.DataFrame({
'A':list('ab'),
'facing_sequence_number':[30,5],
'stacking_layer':[70,8],
'col_2':[5,7],
'shelf_number':[50,3],
})
print (tags_on_mask)
A facing_sequence_number stacking_layer col_2 shelf_number
0 a 30 70 5 50
1 b 5 8 7 3
np.random.seed(100)
masks_df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=tags_on_mask.columns)
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
shelf_number_idx = 1
stacking_layer_idx = 2
facing_sequence_number_idx = 3
pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
cols = ['shelf_number','stacking_layer','facing_sequence_number']
You can pass list to iloc function and convert first values of column to numpy array with select first, but performance is not increase (only better readable code in my opinion):
masks_df.iloc[3, pos] = tags_on_mask[cols].values[0, :]
For improve performance is possible use DataFrame.iat:
masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
Or:
for i, c in zip(pos, cols):
masks_df.iat[2, i] = tags_on_mask[c].values[0]
print (masks_df)
A facing_sequence_number stacking_layer col_2 shelf_number
0 8 8 3 7 7
1 0 4 2 5 2
2 2 50 70 30 8
3 4 50 70 30 2
4 4 1 5 3 4
In [97]: %%timeit
...: pos = [shelf_number_idx, stacking_layer_idx, facing_sequence_number_idx]
...: cols = ['shelf_number','stacking_layer','facing_sequence_number']
...: vals = tags_on_mask[cols].values[0, :]
...: masks_df.iloc[3, pos] = vals
...:
2.34 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [98]: %%timeit
...: masks_df.iat[2, shelf_number_idx] = tags_on_mask['shelf_number'].values[0]
...: masks_df.iat[2, stacking_layer_idx] = tags_on_mask['stacking_layer'].values[0]
...: masks_df.iat[2, facing_sequence_number_idx] = tags_on_mask['facing_sequence_number'].values[0]
...:
34.1 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [100]: %%timeit
...: for i, c in zip(pos, cols):
...: masks_df.iat[2, i] = tags_on_mask[c].values[0]
...:
33.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Comparing values from Pandas data frame column using offset values from another column

I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.

Divide rows of python pandas DataFrame

I have a pandas DataFrame df like this
mat time
0 101 20
1 102 7
2 103 15
I need to divide the rows so the column of time doesn't have any values higher than t=10 to have something like this
mat time
0 101 10
2 101 10
3 102 7
4 103 10
5 103 5
the index doesn't matter
If I'd use groupby('mat')['time'].sum() on this df I would have the original df, but I need like an inverse of the groupby func.
Is there any way to get the ungrouped DataFrame with the condition of time <= t?
I'm trying to use a loop here but it's kind of 'unPythonic', any ideas?
Use an apply function that loops until all are less than 10.
def split_max_time(df):
new_df = df.copy()
while new_df.iloc[-1, -1] > 10:
temp = new_df.iloc[-1, -1]
new_df.iloc[-1, -1] = 10
new_df = pd.concat([new_df, new_df])
new_df.iloc[-1, -1] = temp - 10
return new_df
print df.groupby('mat', group_keys=False).apply(split_max_time)
mat time
0 101 10
0 101 10
1 102 7
2 103 10
2 103 5
You could .groupby('mat') and .apply() a combination of integer division and modulo operation using the cutoff (10) to decompose each time value into the desired components:
cutoff = 10
def decompose(time):
components = [cutoff for _ in range(int(time / cutoff))] + [time.iloc[0] % cutoff]
return pd.Series([c for c in components if c > 0])
df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
to get:
mat
101 10
101 10
102 7
103 10
103 5
In case you care about performance:
%timeit df.groupby('mat', group_keys=False).apply(split_max_time)
100 loops, best of 3: 4.21 ms per loop
%timeit df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
1000 loops, best of 3: 1.83 ms per loop

Categories

Resources