Get first occurrence after specific row in DataFrame - python

My questions regarding this topic are twofold (in bold):
I am using a Dataframe to manage a large amount of data in the following format:
Time
Data
1
Start
2
1
3
2
4
3
5
R
6
A
7
Start
8
3
9
R
The goal is to have a new column storing the time difference from the first message after a 'Start' message until the first 'R' message. This 'Start' to 'Start' pattern cycles repeats thousands of times in a dataframe. I am marking rows that contain the first after a 'Start' message using the following code:
df['Shift'] = df['Data'].shift()
df['first_in_cycle'] = df['Shift'].str.contains('Start')
df.drop(columns='Shift', inplace=True)
Then I attempt to mark the first occurrence of an 'R using the following code:
counter = (df['Data'].str.contains('Start')).cumsum()
first_r = df[df['Data'].str.contains('R')].groupby(counter).transform('idxmin')
df.loc[first_r.index, 'first_R'] = True
However, the result of this is a Dataframe with all of the 'R' messages marked as true, which is incorrect. I am unsure of how to fix this issue.
The plan was to get both of the flag columns correct then merge them using .any(), removing all rows that are marked as false, then using .diff() to calculate the difference between the 'R' and 'Start'. Is this the best way to accomplish this?
The following is the desired output given the initial example:
Time
Data
Time Difference
1
Start
2
1
5
R
3
7
Start
8
3
9
R
1

Use:
#create default index if necessary
df = df.reset_index(drop=True)
#check Start and R
m1 = df['Data'].str.contains('Start')
m2 = df['Data'].str.contains('R')
#groups by Start
g = m1.cumsum()
#get first R per groups to mask
mask1 = m2.groupby(g).transform('idxmax').eq(df.index)
#get Start + one row after Start
mask2 = m1.groupby(g).shift(fill_value=True)
#boolean indexing
df = df[mask1 | mask2]
#get difference per groups
df['Time Difference'] = df['Time'].groupby(g).diff().mask(mask2)
print (df)
Time Data Time Difference
0 1 Start NaN
1 2 1 NaN
4 5 R 3.0
6 7 Start NaN
7 8 3 NaN
8 9 R 1.0

Related

Define column values to be selected / disselected as default

I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).
Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

Pandas dataframe , using iloc to replace last row

I'm trying to replace the last row(s) of a Pandas dataframe using iloc, however I cannot get it to work. There are lots of solutions out there, but the simplest (slowest) is here:
How to do a FIFO push-operation for rows on Pandas dataframe in Python?
Why doesn't this method work in the code below ?
def append_from_dataframe(self,timeframe,new_dataframe):
new_dataframe.reset_index(inplace=True)
temp_dataframe = self.timeframedict.get(timeframe)
num_rows_existing = temp_dataframe.shape[0]
num_rows_new = new_dataframe.shape[0]
overlap = (num_rows_existing + num_rows_new) - 500
# slow, replace with numpy array eventually
if overlap >= 1:
# number of rows to shift
i = overlap * -1
#shift the dataframe back in time
temp_dataframe = temp_dataframe.shift(i)
#self.timeframedict.get(timeframe) = self.timeframedict.get(timeframe).shift(overlap)
#replace the last i rows with the new values
temp_dataframe.iloc[i:] = new_dataframe
self.timeframedict.update({timeframe:temp_dataframe})
else:
#TODO - see this https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe
self.timeframedict.update({timeframe:self.timeframedict.get(timeframe).append(new_dataframe)})
Contents of the dataframe to replace one row in the other:
ipdb> new_dataframe
Timestamp Open High Low Close Volume localtime
0 1533174420000 423.43 423.44 423.43 423.44 0.73765 1533174423776
temp_dataframe.shift(i) shifts value back one, replaces the values with NaN -
ipdb> temp_dataframe.iloc[i:]
Timestamp Open High Low Close Volume localtime
499 NaN NaN NaN NaN NaN NaN NaN
However temp_dataframe.iloc[i:] = new_dataframe does not replace anything.
edit: I should add that after some more playing aroundnow I can replace 1 row with:
temp_dataframe.iloc[-1] = new_dataframe.iloc[0]
however, I cannot get the multiple row version to work
df = pd.DataFrame({'a':[1,2,3,4,5],'b':['foo','bar','foobar','foobaz','food']})
Output:
df
Out[117]:
a b
0 1 foo
1 2 bar
2 3 foobar
3 4 foobaz
4 5 food
Replace last two rows(foobaz and food) with second and first rows respectively:
df.iloc[-2:]=[df.iloc[1],df.iloc[0]]
df
Out[119]:
a b
0 1 foo
1 2 bar
2 3 foobar
3 2 bar
4 1 foo
You can also do this to achieve the same result:
df.iloc[-2:]=df.iloc[1::-1].values

Avoiding looping through pandas dataframe for feature generation

I'm working on some code that generates features from a dataframe and adds these features as columns to the dataframe.
The trouble is I'm working with a time series so that for any given tuple, I need (let's say) 5 of the previous tuples to generate the corresponding feature for that tuple.
lookback_period = 5
df['feature1'] = np.zeros(len(df)) # preallocate
for index, row in df.iterrows():
if index < lookback_period:
continue
slice = df[index - lookback_period:index]
some_int = SomeFxn(slice)
row['feature1'] = some_int
Is there a way to execute this code without explicitly looping through every row and then slicing?
One way is to create several lagged columns using df['column_name'].shift() such that all the necessary information is contained in each row, but this quickly gets intractable for my computer's memory since the dataset is large (millions of rows).
I don't have enough reputation to comment so will just post it here.
Can't you use apply for your dataframe e.g.
df['feature1'] = df.apply(someRowFunction, axis=1)
where someRowFunction will accept the full row and you can perform whatever row based slice and logic you want to do.
--- updated ---
As we do not have much information about the dataframe and the required/expected output, I just based the answer on the information from the comments
Let's define a function that will take a DataFrame slice (based on current row index and lookback) and the row and will return sum of the first column of the slice and value of the current row.
def someRowFunction (slice, row):
if slice.shape[0] == 0:
return 0
return slice[slice.columns[0]].sum() + row.b
d={'a':[1,2,3,4,5,6,7,8,9,0],'b':[0,9,8,7,6,5,4,3,2,1]}
df=pd.DataFrame(data=d)
lookback = 5
df['c'] = df.apply(lambda current_row: someRowFunction(df[current_row.name -lookback:current_row.name],current_row),axis=1)
we can get row index from apply using its name attribute and as such we can retrieve the required slice. Above will result to the following
print(df)
a b c
0 1 0 0
1 2 9 0
2 3 8 0
3 4 7 0
4 5 6 0
5 6 5 20
6 7 4 24
7 8 3 28
8 9 2 32
9 0 1 36

Split dataframe into chunks and add them to a multiindex

I have an indexed dataframe which has 77000 rows.
I want to group every 7000 rows into a higher dimension multiindex, making 11 groups of higher dimension index.
I know that I can write a loop through all the indexes and make a tuple and assign it by dataframe.MultiIndex.from_tuples method.
Is there an elegant way to do this simple thing?
You could use the pd.qcut function to create a new column that you can add to the index.
Here is an example that creates five groups/chunks:
df = pd.DataFrame({'data':range(1,10)})
df['chunk'] = pd.qcut(df.data, 5, labels=range(1,6))
df.set_index('chunk', append=True, inplace=True)
df
data
index chunk
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 4 6
6 4 7
7 5 8
8 5 9
You would do df['chunk'] = pd.qcut(df.index, 11) to get your chunks assigned to your dataframe.
The code below creates an ordered column in the range 0-10, which is tiled up to the length of your DataFrame. Since you want to group based on your old index plus your new folds, you first need to reset the index before performing a groupby.
groups = 11
folds = range(groups) * (len(df) // groups + 1)
df['folds'] = folds[:len(df)]
gb = df.reset_index().groupby(['old_index', 'folds'])
Where old_index is obviously the name of your index.
If you prefer to have sequential groups (e.g. the first 7k rows, the next 7k rows, etc.), then you can do the following :
df['fold'] = [i // (len(df) // groups) for i in range(len(df))]
Note: The // operator is for floor division to truncate any remainder.
Another way is to use the integer division // assuming that your dataframe has the default integer index:
import pandas as pd
import numpy as np
# data
# ===============================================
df = pd.DataFrame(np.random.randn(10), columns=['col'])
df
# processing
# ===============================================
df['chunk'] = df.index // 5
df.set_index('chunk', append=True)
col
chunk
0 0 2.0955
1 0 -1.2891
2 0 -0.3313
3 0 0.1508
4 0 -1.0215
5 1 0.6051
6 1 -0.3227
7 1 -0.6394
8 1 -0.7355
9 1 0.5949

Categories

Resources