I have a pandas DataFrame df like this
mat time
0 101 20
1 102 7
2 103 15
I need to divide the rows so the column of time doesn't have any values higher than t=10 to have something like this
mat time
0 101 10
2 101 10
3 102 7
4 103 10
5 103 5
the index doesn't matter
If I'd use groupby('mat')['time'].sum() on this df I would have the original df, but I need like an inverse of the groupby func.
Is there any way to get the ungrouped DataFrame with the condition of time <= t?
I'm trying to use a loop here but it's kind of 'unPythonic', any ideas?
Use an apply function that loops until all are less than 10.
def split_max_time(df):
new_df = df.copy()
while new_df.iloc[-1, -1] > 10:
temp = new_df.iloc[-1, -1]
new_df.iloc[-1, -1] = 10
new_df = pd.concat([new_df, new_df])
new_df.iloc[-1, -1] = temp - 10
return new_df
print df.groupby('mat', group_keys=False).apply(split_max_time)
mat time
0 101 10
0 101 10
1 102 7
2 103 10
2 103 5
You could .groupby('mat') and .apply() a combination of integer division and modulo operation using the cutoff (10) to decompose each time value into the desired components:
cutoff = 10
def decompose(time):
components = [cutoff for _ in range(int(time / cutoff))] + [time.iloc[0] % cutoff]
return pd.Series([c for c in components if c > 0])
df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
to get:
mat
101 10
101 10
102 7
103 10
103 5
In case you care about performance:
%timeit df.groupby('mat', group_keys=False).apply(split_max_time)
100 loops, best of 3: 4.21 ms per loop
%timeit df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
1000 loops, best of 3: 1.83 ms per loop
Related
I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.
I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.
I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.
Timestamp Boolean_condition
1 1
2 1
3 0
4 1
5 1
6 1
7 0
8 0
9 1
10 0
How to count the total number of time intervals within the whole series that meet this condition?
The desired output should look like this:
Timestamp Boolean_condition Event_number
1 1 1
2 1 1
3 0 NaN
4 1 2
5 1 2
6 1 2
7 0 NaN
8 0 NaN
9 1 3
10 0 NaN
You can create Series with cumsum of two masks and then create NaN by function Series.mask:
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
Name: Boolean_condition, dtype: int32
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
Timestamp Boolean_condition Event_number
0 1 1 1.0
1 2 1 1.0
2 3 0 NaN
3 4 1 2.0
4 5 1 2.0
5 6 1 2.0
6 7 0 NaN
7 8 0 NaN
8 9 1 3.0
9 10 0 NaN
Timings:
#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def nick(df):
isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
return df
def jez(df):
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
def jez1(df):
mask0 = ~df.Boolean_condition
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop
In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop
In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
You could try the following:
1) Get all values of True instance (here, 1) which comprises of isone
2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.
3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.
4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.
isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
Faster approach:
Using only numpy:
1) Get it's array representation.
2) Compute the non-zero, here (1's) indices.
3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.
4) Initialize a new array filled with Nan's of the same shape as that of the original array.
5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.
6) Assign these back to the new column.
def nick(df):
b = df.Bolean_condition.values
slc = np.flatnonzero(b)
slc_pl_1 = np.append(np.nan, slc)
nan_arr = np.full(b.size, fill_value=np.nan)
nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
df['Event_number'] = nan_arr
return df
Timings:
For a DF of 10,000 rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(10000),
Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
)
df1.shape
# (10000, 2)
def jez(df):
mask0 = df.Bolean_condition.eq(0)
mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
return (df)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop
%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop
For a DF containing 1 million rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(1000000),
Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
)
df1.shape
# (1000000, 2)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop
%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
This should work but might be a bit slow for a very long df.
df = pd.concat([df,pd.Series([0]*len(df), name = '2')], axis = 1)
if df.iloc[0,1] == 1:
counter = 1
df.iloc[0, 2] = counter
else:
counter = 0
df.iloc[0,2] = 0
previous = df.iloc[0,1]
for y,x in df.iloc[1:,].iterrows():
print(y)
if x[1] == 1 and previous == 1:
previous = x[1]
df.iloc[y, 2] = counter
if x[1] == 0:
previous = x[1]
df.iloc[y,2] = 0
if x[1] == 1 and previous == 0:
counter += 1
previous = x[1]
df.iloc[y,2] = counter
A custom function does the trick. here is a solution in Matlab code:
Boolean_condition = [1 1 0 1 1 1 0 0 1 0];
Event_number = [NA NA NA NA NA NA NA NA NA NA];
loop_event_number = 1;
for timestamp=1:10
if Boolean_condition(timestamp)==1
Event_number(timestamp) = loop_event_number;
last_event_number = loop_event_number;
else
loop_event_number = last_event_number +1;
end
end
% Event_number = 1 1 NA 2 2 2 NA NA 3 NA
I have a dataframe like this:
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
a b c
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
And I would like to create another column in this dataframe which stores for every row, which other row gets the highest score when performed a dot product against.
For instance for the first row we'll compute the dot products against the other rows:
df1.drop(0).dot(df1.loc[0]).idxmax()
output: 3
Therefore I can create a function:
def get_highest(dataframe):
lis = []
for row in dataframe.index:
temp = dataframe.drop(row).dot(dataframe.loc[row])
lis.append(temp.idxmax())
return lis
And I get what I want with:
df1['highest'] = get_highest(df1)
output:
a b c highest
0 1 5 9 3
1 2 6 10 3
2 3 7 11 3
3 4 8 12 2
Ok that's working, but the problem is that it doesn't scale AT ALL. Here are the outputs of timeit for different number of rows:
4 rows: 2.87 ms
40 rows: 77.1 ms
400 rows: 700 ms
4000 rows: 10.4s
And I have to perform this on a dataframe which has roughly 240k rows and 3.3k columns. Therefore here is my question: Is there a way to optimize this calculation? (likely by addressing it in another way)
Thank you in advance.
Do a matrix multiplication with the transpose:
mat_mul = np.dot(df.values, df.values.T)
Fill diagonals with a small number so they cannot be the maximum (I assumed all positive, so filled with -1 but you can change this):
np.fill_diagonal(mat_mul, -1)
Now take the argmax of the array:
df['highest'] = mat_mul.argmax(axis=1)
Timings on a 10k by 4 df:
%%timeit
mat_mul = np.dot(df.values, df.values.T)
np.fill_diagonal(mat_mul, -1)
df['highest'] = mat_mul.argmax(axis=1)
1 loop, best of 3: 782 ms per loop
%timeit df['highest'] = get_highest(df)
1 loop, best of 3: 9.8 s per loop
Since the dot-products would be repeated for pairs when they are flipped, the final dot-product array for each row against every other row would be a symmetric one. So, we can calculate for either the lower or upper triangular dot product elements and then get the full form by using scipy's squareform. Thus, we would have an implementation like so -
from scipy.spatial.distance import squareform
arr = df1.values
R,C = np.triu_indices(arr.shape[0],1)
df1['highest'] = squareform(np.einsum('ij,ij->i',arr[R],arr[C])).argmax(1)
Output for sample case -
In [145]: df1
Out[145]:
a b c highest
0 1 5 9 3
1 2 6 10 3
2 3 7 11 3
3 4 8 12 2
Starting with this question as base.
Python Pandas: remove entries based on the number of occurrences
data = pandas.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
# pid tag
# 0 1 23
# 1 1 45
# 2 1 62
# 3 2 24
# 4 2 45
# 5 3 34
# 6 3 25
# 7 3 62
g = data.groupby('tag')
g.filter(lambda x: len(x) > 1) # filters out lengths > 1.
# pid tag
# 1 1 45
# 2 1 62
# 4 2 45
# 7 3 62
#This would create a new object g:
g = g.filter(lambda x: len(x) > 1) #where g is now a dataframe.
I was wondering is there a way to filter out 'groups' by deleting
them from original object g. And, would it be faster than creating a new groupby object from filtered groupby.
There is only so many ways you can solve this problem. My answer includes 4 solutions. I am sure, there are other ways. Maybe some other answers will present a better way.
Solution #1:
data = data.groupby('tag').filter(lambda x: len(x) > 1)
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
Solution #2:
data['count'] = data.groupby(['tag']).transform('count')
data.loc[data['count'] == 2]
pid tag count
1 1 45 2
2 1 62 2
4 2 45 2
7 3 62 2
Solution #3:
If you want to delete the rows instead, you can use .index.tolist() and then drop().
data['count'] = data.groupby(['tag']).transform('count')
data.drop(data[data['count'] != 2].index.tolist())
pid tag count
1 1 45 2
2 1 62 2
4 2 45 2
7 3 62 2
Solution #4:
data['count'] = data.groupby(['tag']).transform('count')
g = data.groupby('count')
data.loc[g.groups[2],('tag','pid')]
tag pid
1 45 1
2 62 1
4 45 2
7 62 3
A couple of options (yours is at the bottom):
This first one is inplace and as quick as I could make it. Its a bit quicker than your solution but not by virtue of dropping rows in place. I can get even better performance with the second option and this does not change in place.
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
mask = ~data.duplicated(subset=['tag'], keep=False)
data.drop(mask[mask].index, inplace=True)
data
1000 loops, best of 3: 1.16 ms per loop
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
data = data.loc[data.duplicated(subset=['tag'], keep=False)]
data
1000 loops, best of 3: 719 µs per loop
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
g = data.groupby('tag')
g = g.filter(lambda x: len(x) > 1)
g
1000 loops, best of 3: 1.55 ms per loop