How to split the rows of a dataset depending on certain conditions?

How to split the rows of a dataset depending on certain conditions? - python

I want to be able to compare per user the scores of first and last row (there are only two rows per user) of the dataset and see if the score increases or decreases between the sessions.
I have a dataset looking like this:
User Session Score
1 1 4
2 5
2 1 5
2 3
3 1 4
2 5
4 1 3
2 3
I have no idea how to call upon an index with a specific condition.

Here is an approach using df.groupby().nth() and df.agg()
first_last = df.groupby('User').nth([0, -1])
diff = first_last.groupby('User')['Score'].agg(lambda x: x.iloc[-1] - x.iloc[0])
print(diff)
User
1 1
2 -2
3 1
4 0
Name: Score, dtype: int64

Related

Filter based on pairs within a group - if value represent at the end

Group Code
1 2
1 2
1 4
1 1
2 4
2 1
2 2
2 3
2 1
2 1
2 3
Within each group there are pairs. In Group 1 for example; the pairs are (2,2),(2,4),(4,1)
I want to filter these pairs based on code number 2 OR 4 being present at the END of the pair. In group 1 for example, only (2,2) and (2,4) will be kept while (4,1) will be filtered out.
The code am I using for determining code number being present at the beginning is
df[df.groupby("Group")['Code'].shift().isin([2,4])|df['Code'].isin([2,4])]
Excepted Output:
Group Code
1 2
1 2
1 4
2 1
2 2

Using your own suggested code, you can modify it to achieve your goal:
idx = df.groupby("Group")['Code'].shift(-1).isin([2,4])
df[idx | idx.shift()]
First you groupby 'Group' and then shift one up and check for values 2 or 4. Finally, you want both the end of pairs satisfying the condition (i.e. idx) and the begin of the pair (i.e. idx.shift())
output:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2

Assuming the data is sorted by Group, you can also do it without using groupby() to save some processing and speed up the process, as follows:
m = df['Code'].isin([2,4]) & df['Group'].eq(df['Group'].shift())
df[m | m.shift(-1)]
Result:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2

drop row if one column is greater than another

I have the following data frame:
order_id amount records
1 2 1
2 5 10
3 20 5
4 1 3
I want to remove rows where the amount is greater than the records, the output should be:
order_id amount records
2 5 10
4 1 3
Here is what I've attempted:
df = df.drop(
df[df.amount > df.records].index, inplace=True)
this is removing all rows, any suggestions are welcome.

Simply filter by:
df = df[df['amount']<df['records']]
and you get the desired results:
order_id amount records
1 2 5 10
3 4 1 3

df.loc[~df.amount.gt(df.records)]
order_id amount records
1 2 5 10
3 4 1 3
Explanation: comparisions return a boolean:
~df.amount.gt(df.records)
0 False
1 True
2 False
3 True
dtype: bool
This returns values where amount is not greater than records.
You can use this boolean to index into the dataframe to get your desired values.
Alternatively, you could use the code below as well, without having to call on the negation (~) :
df.loc[df.amount.le(df.records)]

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.

First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Python Pandas: Create Column That Acts As A Conditional Running Variable

I'm trying to create a new dataframe column that acts as a running variable that resets to zero or "passes" under certain conditions. Below is a simplified example of what I'm looking to accomplish. Let's say I'm trying to quit drinking coffee and I'm tracking the number of days in a row i've gone without drinking any. On days where I forgot to make note of whether I drank coffee, I put "forgot", and my tally does not get influenced.
Below is how i'm currently accomplishing this, though I suspect there's a much more efficient way of going about it.
Thanks in advance!
import pandas as pd
Day = [1,2,3,4,5,6,7,8,9,10,11]
DrankCoffee = ['no','no','forgot','yes','no','no','no','no','no','yes','no']
df = pd.DataFrame(list(zip(Day,DrankCoffee)), columns=['Day','DrankCoffee'])
df['Streak'] = 0
s = 0
for (index,row) in df.iterrows():
if row['DrankCoffee'] == 'no':
s += 1
if row['DrankCoffee'] == 'yes':
s = 0
else:
pass
df.loc[index,'Streak'] = s

you can use groupby.transform
for each streak, what you're looking for is something like this:
def my_func(group):
return (group == 'no').cumsum()
you can divide the different streak with simple comparison and cumsum
streak = (df['DrankCoffee'] == 'yes').cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
then apply the transform
df['Streak'] = df.groupby(streak)['DrankCoffee'].transform(my_func)

You need firstly map you DrankCoffee to [0,1](Base on my understanding yes and forgot should be 0 and no is 1), then we just do groupby cumsum to create the group key , when there is yes we start a new round for count those evens
df.DrankCoffee.replace({'no':1,'forgot':0,'yes':0}).groupby((df.DrankCoffee=='yes').cumsum()).cumsum()
Out[111]:
0 1
1 2
2 2
3 0
4 1
5 2
6 3
7 4
8 5
9 0
10 1
Name: DrankCoffee, dtype: int64

Use:
df['Streak'] = df.assign(streak=df['DrankCoffee'].eq('no'))\
.groupby(df['DrankCoffee'].eq('yes').cumsum())['streak'].cumsum().astype(int)
Output:
Day DrankCoffee Streak
0 1 no 1
1 2 no 2
2 3 forgot 2
3 4 yes 0
4 5 no 1
5 6 no 2
6 7 no 3
7 8 no 4
8 9 no 5
9 10 yes 0
10 11 no 1
First, create streak increment when 'no' then True.
Next, create streak when 'yes' start a new streak using cumsum().
Lastly, use cumsum to count streak increment in streaks with
cumsum().

How to generate continuous record from incomplete data in Pandas Dataframe

Ok I have a dataset regarding game outcomes that is incomplete and I want to generate a plot with either the data present or zero values for the players that have no data in that game. Furthermore I want to add the data present via a list: some players are attackers and some defenders
My data is like this:
raw data:
Game Player Goal Assits Fouls
1 Alpha 1 1 0
1 Beta 2 0 1
2 Alpha 0 1 1
2 Gamma 2 0 0
3 Beta 3 0 1
4 Alpha 1 1 1
4 Beta 2 0 1
5 Alpha 0 1 1
5 Beta 1 0 0
5 Gamma 0 1 1
desired result with Points for Goals + Assists and Attackers = ['Alpha','Beta'] and Defenders=['Gamma']
Game Attackers Defenders
1 4 0
2 1 2
3 3 0
4 4 0
5 2 1
I have all the raw data in a pandas dataframe and I have tried using isin function to get the data out. This leaves me with different length results, ie if it is "not in" then there is no data added. I would (as shown just like zeros instead .
==> ie in Game 1 Gamma is not mentioned so he has zero points.
thank you for your help

This is a bit messy, but certainly doable.
First of all, you'll need to reset_index() on df, to make grouping easier. Groupby doesn't handle grouping on an index and a column at the same time gracefully (GH issue).
In [64]: df = df.reset_index()
Define a mapping from player to position (attacker or defender):
In [65]: kind = {'Alpha': 'Attackers', 'Beta': 'Attackers', 'Gamma': 'Defenders'}
Ideally you'd be able to do the next 3 steps in one line, but I was having trouble with the aggregation. First get the grouping by position and game.
In [66]: grouped = df.groupby(['Game', df.Player.map(kind)]).sum()
In [67]: grouped
Out[67]:
Goal Assits Fouls
Game Player
1 Attackers 3 1 1
2 Attackers 0 1 1
Defenders 2 0 0
3 Attackers 3 0 1
4 Attackers 3 1 2
5 Attackers 1 1 1
Defenders 0 1 1
[7 rows x 3 columns]
Then calculate the points, which gives a Series:
In [68]: points = grouped['Goal'] + grouped['Assits']
In [69]: points
Out[69]:
Game Player
1 Attackers 4
2 Attackers 1
Defenders 2
3 Attackers 3
4 Attackers 4
5 Attackers 2
Defenders 1
dtype: int64
Finally unstack(). This creates NaNs where there aren't any values (e.g. Game 1, Defenders), which we'll fill with 0.
In [70]: points.unstack('Player').fillna(0)
Out[70]:
Player Attackers Defenders
Game
1 4 0
2 1 2
3 3 0
4 4 0
5 2 1
[5 rows x 2 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split the rows of a dataset depending on certain conditions? - python

Here is an approach using df.groupby().nth() and df.agg() first_last = df.groupby('User').nth([0, -1]) diff = first_last.groupby('User')['Score'].agg(lambda x: x.iloc[-1] - x.iloc[0]) print(diff) User 1 1 2 -2 3 1 4 0 Name: Score, dtype: int64

Related

Filter based on pairs within a group - if value represent at the end

drop row if one column is greater than another

Removing duplicates based on two columns while deleting inconsistent data

Python Pandas: Create Column That Acts As A Conditional Running Variable

How to generate continuous record from incomplete data in Pandas Dataframe

Categories

Resources