Get nth row after applying lambda on groupby in python - python

So I need to group a dataframe by its SessionId, and then I need to sort each group with the created time, afterwards i need to retrieve the nth row only of each group.
but i found that after applying lambda it becomes a dataframe instead of a group by object, hence i cannot use the .nth property
grouped = df.groupby(['SessionId'])
sorted = grouped.apply(lambda x: x.sort_values(["Created"], ascending = True))
sorted.nth ---> error

Changing the order in which you are approaching the problem in this case will help. If you first sort and then use groupby, you will get the desired output and you can use the groupby.nth function.
Here is a code snippet to demonstrate the idea:
df = pd.DataFrame({'id':['a','a','a','b','b','b'],
'var1':[3,2,1,8,7,6],
'var2':['g','h','i','j','k','l']})
n = 2 # replace with required row from each group
df.sort_values(['id','var1']).groupby('id').nth(n).reset_index()
Assuming id is your sessionid and var1 is the timestamp, this sorts your dataframe by id and then var1. Then picks up the nth row from each of these sorted groups. The reset_index() is there just to avoid the resulting multi-index.
If you want to get the last n rows of each group, you can use .tail(n) instead of .nth(n).

I have created a small dataset -
n = 2
grouped = df.groupby('SessionId')
pd.concat([grouped.get_group(x).sort_values(by='SortVar').reset_index().loc[[n]] for x in grouped.groups]\
,axis=0)
This will return -
Please note that in python index start from zero, so for n=2, it will give you 3rd row in sorted data

Related

Python: Lambda function with multiple conditions based on multiple previous rows

I am trying to define a lambda function that assigns True or False to a row based on various conditions.
There is a column with a Timestamp and what I want is, that if within the last 10 seconds (based on the timestamp of the current row x) some specific values occured in other columns of the dataset, the current row x gets the True or False tag.
So basically I have to check whether in the previous n rows, i.e. Timestamp(x) - 10 seconds value a occured in column A and value b occured in column B.
I already looked at the shift() function with freq = 10 seconds and another attempt looked like that:
data['Timestamp'][(data['Timestamp']-pd.Timedelta(seconds=10)):data['Timestamp']]
But I wasn't able to proceed with either of the two options.
Is it possible to start an additional select within a lambda function? If yes, how could that look like?
P.S.: Working with regular for-loops instead of the lambda function is not an option due to the overall setup of the application/code.
Thanks for your help and input!
Perhaps you're looking for something like this, if I understood correctly:
def create_tag(current_timestamp, df, cols_vals):
# Before the current timestamp
mask = (df['Timestamp'] <= current_timestamp)
# After the current timestamp - 10s
mask = mask & (df['Timestamp'] >= current_timestamp - pd.to_timedelta('10s'))
# Filter all dataframe following the mask
filtered = df[mask]
# Check if each val of col is present
present = all(value in filtered[column_name].values for column_name, value in cols_vals.items())
return present
data['Tag'] = data['Timestamp'].apply(lambda x: create_tag(x, data, {'column A': 'a', 'column B', 'b'}))
The idea behind this code is, for each timestamp that you have, we're going to apply the create_tag function. This takes the current timestamp, the whole dataframe as well as a dictionary containing column names as keys and the respective values you're looking for as values.

Find the next row within a group based on substring condition on a column - Pandas

I'm trying to get the next row within a group based on the substring condition.
The below code is ignoring the grouping conditions.
df_shifted_rows = df[df.groupby(['id'])\
['url'].apply(lambda x: x.str.contains("confirmation"))\
.shift(1).fillna(False)]
If the matching value in the current group is in the last row, then the shift should give me null. But it is moving to the next group and giving the first value in the next group.
Split it into two steps
m = df['url'].str.contains("confirmation")
df = df[m.groupby(df['id']).shift(1)]
By tweaking the existing code. Bring the shift inside apply
df_shifted_rows = df[df.groupby(['id'])\
['url'].apply(lambda x: x.str.contains("confirmation")\
.shift(1).fillna(False))]

Compare values in a row and write result in new column

My dataset looks like this:
Paste_Values AB_IDs AC_IDs AD_IDs
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2
AE-2182-4 AB-2182-6 AC-2182-7 AD-2182-5
I need to compare all values in the Paste_values column with the all other three values in a row.
For Example:
AE-1001-4 is split into two part AE and 1001-4 we need check 1001-4 is present other columns or not
if its not present we need to create new columns put the same AE-1001-4
if 1001-4 match with other columns we need to change it inot 'AE-1001-5' put in the new column
After:
If there is no match I need to to write the value of Paste_values as is in the newly created column named new_paste_value.
If there is a match (same value) in other columns within the same row, then I need to change the last digit of the value from Paste_values column, so that the whole value should not be the same as in any other whole values in the row and that newly generated value should be written in new_paste_value column.
I need to do this with every row in the data frame.
So the result should look like:
Paste_Values AB_IDs AC_IDs AD_IDs new_paste_value
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2 AE-1001-4
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1 AE-1964-3
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2 AE-2211-4
AE-2182-4 AB-2182-6 AC-2182-4 AD-2182-5 AE-2182-1
How can I do it?
Start from defining a function to be applied to each row of your DataFrame:
def fn(row):
rr = row.copy()
v1 = rr.pop('Paste_Values') # First value
if not rr.str.contains(f'{v1[3:]}$').any():
return v1 # No match
v1a = v1[3:-1] # Central part of v1
for ch in '1234567890':
if not rr.str.contains(v1a + ch + '$').any():
return v1[:-1] + ch
return '????' # No candidate found
A bit of explanation:
The row argument is actually a Series, with index values taken from
column names.
So rr.pop('Paste_Values') removes the first value, which is saved in v1
and the rest remains in rr.
Then v1[3:] extracts the "rest" of v1 (without "AE-")
and str.contains checks each element of rr whether it
contains this string at the end position.
With this explanation, the rest of this function should be quite
understandable. If not, execute each individual instruction and
print their results.
And the only thing to do is to apply this function to your DataFrame,
substituting the result to a new column:
df['new_paste_value'] = df.apply(fn, axis=1)
To run a test, I created the following DataFrame:
df = pd.DataFrame(data=[
['AE-1001-4', 'AB-1001-0', 'AC-1001-3', 'AD-1001-2'],
['AE-1964-7', 'AB-1964-2', 'AC-1964-7', 'AD-1964-1'],
['AE-2211-1', 'AB-2211-1', 'AC-2211-3', 'AD-2211-2'],
['AE-2182-4', 'AB-2182-6', 'AC-2182-4', 'AD-2182-5']],
columns=['Paste_Values', 'AB_IDs', 'AC_IDs', 'AD_IDs'])
I received no error on this data. Perform a test on the above data.
Maybe the source of your error is in some other place?
Maybe your DataFrame contains also other (float) columns,
which you didn't include in your question.
If this is the case, run my function on a copy of your DataFrame,
with this "other" columns removed.

Replace all column values based on another dataframe's index

I am trying to map the values of one df column with values in another one.
First df contains football match results:
Date|HomeTeam|AwayTeam
2009-08-15|0|2
2009-08-15|18|15
2009-08-15|20|10
Second df contains teams and has only one column:
TeamName
Arsenal
Bournetmouth
Chelsea
The end result is the first df with matches but with team names instead of numbers in "HomeTeam" and "AwayTeam". The numbers in the first df mean indexes of the second one.
I've tried ".replace":
for item in matches.HomeTeam:
matches = matches.replace(to_replace = matches.HomeTeam[item], value=teams.TeamName[item])
It did replace the values for some items (~80% of them), but ignored the other ones. I could not find a way to replace the other values.
Please let me know what I did wrong and how this can be fixed. Thanks!
Maybe try using applymap:
df[['HomeTeam', 'AwayTeam']] = df[['HomeTeam', 'AwayTeam']].applymap(lambda x: teams['TeamName'].tolist()[x])
And now:
print(df)
Output will be as expected.
I assume that teams is also a DataFrame, something like:
teams = pd.DataFrame(data=[['Team_0'], ['Team_1'], ['Team_2'], ['Team_3'],
['Team_4'], ['Team_5'], ['Team_6'], ['Team_7'], ['Team_8'],
['Team_9']], columns=['TeamName'])
but you failed to include the index in the provided sample (actually, in
both samples).
Then my proposition is:
matches.set_index('Date')\
.applymap(lambda id: teams.loc[id, 'TeamName'])\
.reset_index()

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Categories

Resources