How to delete row in Pandas Dataframe using 2 colums as condition? - python

Basically, I got a table like the following:
Name Sport Frequency
Jonas Soccer 3
Jonas Tennis 5
Jonas Boxing 4
Mathew Soccer 2
Mathew Tennis 1
John Boxing 2
John Boxing 3
John Soccer 1
Let's say this is a standard table and I will transform that into a Pandas DF, using the groupby function just like that:
table = df.groupby(['Name'])
After the dataframe is created I want to delete all the rows where frequencies of all other sports than Soccer are greater than Soccer frequency.
So I need to run following conditions:
Identify where Soccer is present; and then
If so, identify if there is any other sport present; and finally
Delete rows where sport is any other than Soccer and its frequency is greater than the Soccer frequency associated to that name (used in the groupby function).
So, the output would be something like:
Name Sport Frequency
Jonas Soccer 3
Mathew Soccer 2
Mathew Tennis 1
John Soccer 1
Thank you for your support

This is one way about it, by iterating through the groups :
pd.concat(
[
value.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.bfill()
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
for key, value in df.groupby("Name").__iter__()
]
)
Name Sport Frequency
7 John Soccer 1
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
You could also create a categorical type for the Sports column, sort the dataframe, then group :
sport_dtype = pd.api.types.CategoricalDtype(categories=df.Sport.unique(), ordered=True)
df = df.astype({"Sport": sport_dtype})
(
df.sort_values(["Name", "Sport"], ascending=[False, True])
.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
)
Name Sport Frequency
3 Mathew Soccer 2
4 Mathew Tennis 1
0 Jonas Soccer 3
7 John Soccer 1
Note that this works because Soccer is the first entry in the Sports column; if it is not, you have to reorder it to ensure Soccer is the first in the categories
Another option is to get the index of rows that meet our criteria and filter the dataframe :
index = (
df.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.groupby("Name")
.pipe(lambda x: x.ffill().bfill())
.query("Frequency <= temp")
.index
)
df.loc[index]
Name Sport Frequency
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
7 John Soccer 1
A bit surprised that I lost the grouping index though.
UPDATE : Gave this some thought; this may be a simpler solution, find the rows where sport is soccer or the average is greater than or equal to 0.5. the average ensures that soccer is not less than the others.
(df.assign(temp=df.Sport == "Soccer",
temp2=lambda x: x.groupby("Name").temp.transform("mean"),
)
.query('Sport=="Soccer" or temp2>=0.5')
.iloc[:, :3]
)

Related

How to Aggregate data based on multiple columns in Python

I'm trying to aggregate text fields based on date and category columns. And below is how the initial dataset looks like
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics
So expected output that I want to get is the below
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics
As per the example I actually want to aggregate tweet text based on date and category. Below is how I used to aggregate without considering category, where I'm in need of aggregation as per the output above.It would be very helpful if anyone can answer this
import pandas as pd
def aggregated():
tweets = pd.read_csv(r'data_set.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
pd.set_option('display.max_colwidth', 0)
df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
return df
# Driver code
if __name__ == '__main__':
print(aggregated())
aggregated().to_csv(r'agg-1.csv',index = True, header=True)
You can use:
out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
.apply(lambda x: ' '.join(x))[df.columns]
print(out)
Output:
>>> out
created_at tweet category
0 7/29/2021 Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan sports
1 7/29/2021 President Made a clear statement An election is to be kept next year politics
2 7/29/2021 A terrorist attack have killed 10 people crime
3 8/29/2021 Srilanka have lost the T20 series Australia have won the series sports
4 8/29/2021 Minister have given up his role last monday President is challenging the opposite leader politics
df is your example
at first tweet column make list with groupby and join list by apply
df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df
output:
created_at tweet category
0 7/29/2021 A terrorist attack have killed 10 people crime
1 7/29/2021 President Made a clear statement An election i... politics
2 7/29/2021 Great Sunny day for Cricket at London Great Sc... sports
3 8/29/2021 Minister have given up his role last monday Pr... politics
4 8/29/2021 Srilanka have lost the T20 series Australia ha... sports

How to group by category and then count the frequency of words using Pandas

my df looks like this:
category text
-------- ----
soccer soccer game is good
soccer soccer game
basketball game basketball
basketball game
volleyball sport volleyball sport
What I want to do is groupby category and then list the words by its frequency
category text frequency
-------- ---- ---------
soccer soccer 2
game 2
is 1
good 1
basketball game 2
basketball 1
volleyball sport 2
volleyball 1
what did I do?
I group all the text together
df.groupby(['category])['text'].sum()
Now all the text are on the same rows since I grouped it but I do not know how to do a Frequency Table using each word count.
Could someone please help me?
#Method 1:
You can use series.str.split with explode and the groupby.value_counts
(df.assign(text=df['text'].str.split()).explode("text")
.groupby("category",sort=False)['text'].value_counts())
category text
soccer game 2
soccer 2
good 1
is 1
basketball game 2
basketball 1
volleyball sport 2
volleyball 1
Name: text, dtype: int64
#Method 2:
For older version of pandas using np.concatenate and index.repeat with df.join (There are other methods listed here)
s = df['text'].str.split()
(df[['category']].join(pd.Series(np.concatenate(s),
index=df.index.repeat(s.str.len()),name='text'))
.groupby("category",sort=False)['text'].value_counts())
#Method 3: using MultiLabelBinarizer from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
s = df['text'].str.split()
mlb = MultiLabelBinarizer()
mlb.fit(s)
out = pd.DataFrame(mlb.transform(s),columns=mlb.classes_).groupby(df['category']).sum()
out.replace(0,np.nan).stack().astype(int)
category
basketball basketball 1
game 2
soccer game 2
good 1
is 1
soccer 2
volleyball sport 1
volleyball 1
dtype: int32
value_counts
is the right way. Usable inside a groupby too after split in words

Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.
If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)

Sort text in second column based on values in first column

in python i would like to separate the text in different rows based on the values of the first number. So:
Harry went to School 100
Mary sold goods 50
Sick man
using the provided information below:
number text
1 Harry
1 Went
1 to
1 School
1 100
2 Mary
2 sold
2 goods
2 50
3 Sick
3 Man
for i in xrange(0, len(df['number'])-1):
if df['number'][i+1] == df['number'][i]:
# append text (e.g Harry went to school 100)
else:
# new row (Mary sold goods 50)
You can use groupby,
for name,group in df.groupby(df['number']):
print ' '.join([i for i in group['text']])
Result
Harry Went to School 100
Mary sold goods 50
Sick Man

How to quickly convert two column CSV to a table/excel spreadsheet?

I have a two column CSV:
Name, Sport
Abraham Soccer
Adam Basketball
Adam Soccer
John Soccer
Jacob Tennis
Jacob Soccer
What is the simplest way to convert this into something openable in Excel that is either in XLS or CSV such that when opening up in MS Excel, it looks something like:
Basketball, Soccer, Tennis
Abraham X
Adam X X X
John X
Jacob X X
I would consider pandas to be a suitable package for this kind of application. The centerpiece of pandas is a dataframe object (df), which is in essence a table of your data. csv files can be read into pandas using read_csv.
import pandas as pd
df = pd.read_csv('filename.csv')
In [3]:df
Out[3]:
Name Sport
0 Abraham Soccer
1 Adam Basketball
2 Adam Soccer
3 John Soccer
4 Jacob Tennis
5 Jacob Soccer
There is a pandas method crosstab that does what you want as simply as
table = pd.crosstab(df['Name'], df['Sport'])
In [4]:table
Out[4]:
Sport Basketball Soccer Tennis
Name
Abraham 0 1 0
Adam 1 1 0
Jacob 0 1 1
John 0 1 0
Then you can convert back to a csv file with
table.to_csv('filename.csv')

Categories

Resources