I have a dataframe (ev), and I would like to read it and whenever the value of the 'trig' column is 64, I need to update the value of the critical column that is 4 rows above, and change it to 999. I tried the code below but it does not change anything, though seems it should work.
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] == 999
Try this, you were close: Learn about single "=" vs double "=="
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
I think you can mask by mask with fillna False, because after shift you get NaN:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 5 4
5 7 6
6 8 8
7 9 9
8 0 64
9 7 6
10 6 7
11 4 8
12 3 6
13 5 64
mask = (ev.trig == 64).shift(-4).fillna(False)
print (mask)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 False
13 False
Name: trig, dtype: bool
ev['critical'] = ev.critical.mask(mask, 999)
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 999 4
5 7 6
6 8 8
7 9 9
8 0 64
9 999 6
10 6 7
11 4 8
12 3 6
13 5 64
EDIT:
Timings:
I think better is avoiding iteration in pandas, because in large dataframe it is very slow:
len(df)=1400:
In [66]: %timeit (jez(ev))
1000 loops, best of 3: 1.29 ms per loop
In [67]: %timeit (mer(ev1))
10 loops, best of 3: 49.9 ms per loop
len(df)=14k:
In [59]: %timeit (jez(ev))
100 loops, best of 3: 2.49 ms per loop
In [60]: %timeit (mer(ev1))
1 loop, best of 3: 501 ms per loop
len(df)=140k:
In [63]: %timeit (jez(ev))
100 loops, best of 3: 15.8 ms per loop
In [64]: %timeit (mer(ev1))
1 loop, best of 3: 6.32 s per loop
Code for timings:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
ev = pd.concat([ev]*100).reset_index(drop=True)
#ev = pd.concat([ev]*1000).reset_index(drop=True)
#ev = pd.concat([ev]*10000).reset_index(drop=True)
ev1 = ev.copy()
def jez(df):
ev['critical'] = ev.critical.mask((ev.trig == 64).shift(-4).fillna(False), 999)
return (ev)
def mer(df):
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
return (ev)
print (jez(ev))
print (mer(ev1))
Related
I have a wide dataset:
id x0 x1 x2 x3 x4 x5 ... x10000 Type
1 40 31.05 25.5 25.5 25.5 25 ... 33 1
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 0
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 1
4 40 31.50 23.5 24.5 26.5 25 ... 33 1
...
900 40 31.05 25.5 25.5 25.5 25 ... 23 0
with each row being a time series. I would like to standardise in place all values except for the last column, with each row/time series as an independent distribution. I am thinking about appending 2 columns mean and std(standard deviation) to the rightmost of the dataframe, and standardise using apply. But it sounds cumbersome and might make mistakes in the process. How can I do this and is there an easier way? Thanks
Method 1:
We can use sklearn.preprocessing.scale! Set axis = 1 to scale data on each row!
This kind of data cleaning can be done nicely using sklearn.preprocessing.Here is an official doc
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and then standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Method 2:
Just use lambda function if your dataset is not huge.
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and than standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = data.loc[:,~data.columns.isin(['type'])].\
apply(lambda x: (x - np.mean(x))/np.std(x), axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Speed compare:
Method 1 is faster then method 2.
Method 1: 2.03 ms ± 205 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
Method 2: 3.06 ms ± 153 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])].apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)
You could compute mean and std manually:
stats = df.iloc[:,1:-1].agg(['mean','std'], axis=1) # axis=1 apply on rows
df.iloc[:, 1:-1] = (df.iloc[:, 1:-1]
.sub(stats['mean'], axis='rows') # axis='rows' apply on rows
.div(stats['std'],axis='rows')
)
output:
id x0 x1 x2 x3 x4 x5 x10000 Type
0 1 1.87515 0.297204 -0.681302 -0.681302 -0.681302 -0.769456 0.641003 1
1 2 0.31841 0.499129 0.679848 -1.72974 0.679848 0.679848 -1.12734 0
2 3 -0.0363456 0.218074 0.508839 0.508839 0.508839 0.508839 -2.21708 1
3 4 1.81012 0.392987 -0.940787 -0.774066 -0.440622 -0.690705 0.64307 1
I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.
I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.
I have a pandas DataFrame df like this
mat time
0 101 20
1 102 7
2 103 15
I need to divide the rows so the column of time doesn't have any values higher than t=10 to have something like this
mat time
0 101 10
2 101 10
3 102 7
4 103 10
5 103 5
the index doesn't matter
If I'd use groupby('mat')['time'].sum() on this df I would have the original df, but I need like an inverse of the groupby func.
Is there any way to get the ungrouped DataFrame with the condition of time <= t?
I'm trying to use a loop here but it's kind of 'unPythonic', any ideas?
Use an apply function that loops until all are less than 10.
def split_max_time(df):
new_df = df.copy()
while new_df.iloc[-1, -1] > 10:
temp = new_df.iloc[-1, -1]
new_df.iloc[-1, -1] = 10
new_df = pd.concat([new_df, new_df])
new_df.iloc[-1, -1] = temp - 10
return new_df
print df.groupby('mat', group_keys=False).apply(split_max_time)
mat time
0 101 10
0 101 10
1 102 7
2 103 10
2 103 5
You could .groupby('mat') and .apply() a combination of integer division and modulo operation using the cutoff (10) to decompose each time value into the desired components:
cutoff = 10
def decompose(time):
components = [cutoff for _ in range(int(time / cutoff))] + [time.iloc[0] % cutoff]
return pd.Series([c for c in components if c > 0])
df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
to get:
mat
101 10
101 10
102 7
103 10
103 5
In case you care about performance:
%timeit df.groupby('mat', group_keys=False).apply(split_max_time)
100 loops, best of 3: 4.21 ms per loop
%timeit df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
1000 loops, best of 3: 1.83 ms per loop
Is it possible in pandas to select the 5 rows before/after a specific row if they match a specific condition?
For instance, is it possible to start from row 19, and then select the five preceding rows for which b is True (thus selecting 16,16,10,7, and 4). I would call this 'relative' location. (Is there a better term for this? A place I can read about this type of lookup?)
a | b
=============
0 True
1 False
4 True
7 True
9 False
10 True
13 True
16 True
18 False
19 True
try this:
In [31]: df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
Out[31]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
Step by step:
index of the element where a==19:
In [32]: df[df.a == 19].index[0]
Out[32]: 9
now we can list all elements where b is True and which index is less than 9:
In [30]: df.ix[(df.b) & (df.index <9)].tail(5)
Out[30]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
now combine both of them:
In [33]: df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
Out[33]:
a b
2 4 True
3 7 True
5 10 True
6 13 True
7 16 True
Speed it up a little bit:
In [103]: idx19 = df[df.a == 19].index[0]
In [104]: idx19
Out[104]: 9
In [107]: %timeit df.ix[(df.b) & (df.index < df[df.a == 19].index[0])].tail(5)
1000 loops, best of 3: 973 µs per loop
In [108]: %timeit df.ix[(df.b) & (df.index < idx19)].tail(5)
1000 loops, best of 3: 564 µs per loop
PS so it got 42% faster