i want to extract dataframe that meet certain conditions using python, pandas - python

I call Excel data with the tuples Time, Name, Good, Bad using python and pandas.
I want to reprocess dataframe to another dataframe that meet certain conditions.
In detail, i would like to print out a dataframe that stores the sum of Good and Bad data for each Name during the entire time.
please help me anybody who knows well python, pandas.
enter image description here

First aggregate sum by DataFrame.groupby, change columns names by DataFrame.add_prefix, add new column by DataFrame.assign and last convert index to column by DataFrame.reset_index:
df = pd.DataFrame({
'Name':list('aaabbb'),
'Bad':[1,3,5,7,1,0],
'Good':[5,3,6,9,2,4]
})
df1 = (df.groupby('Name')['Good','Bad']
.sum()
.add_prefix('Total_')
.assign(Total_Count = lambda x: x.sum(axis=1))
.reset_index())
print (df1)
Name Total_Good Total_Bad Total_Count
0 a 14 9 23
1 b 15 8 23

Use pandas NamedAgg with eval,
df.groupby('Name')[['Good', 'Bad']]\
.agg(Total_Good=('Good','sum'),
Total_Bad=('Bad', 'sum'))\
.eval('Total_Count = Total_Good + Total_Bad')

Related

Pandas groupby.sum for all columns

I have a dataset with a set of columns I want to sum for each row. The columns in question all follow a specific naming pattern that I have been able to group in the past via the .sum() function:
pd.DataFrame.sum(data.filter(regex=r'_name$'),axis=1)
Now, I need to complete this same function, but, when grouped by a value of a column:
data.groupby('group').sum(data.filter(regex=r'_name$'),axis=1)
However, this does not appear to work as the .sum() function now does not expect any filtered columns. Is there another way to approach this keeping my data.filter() code?
Example toy dataset. Real dataset contains over 500 columns where all columns are not cleanly ordered:
toy_data = ({'id':[1,2,3,4,5,6],
'group': ["a","a","b","b","c","c"],
'a_name': [1,6,7,3,7,3],
'b_name': [4,9,2,4,0,2],
'c_not': [5,7,8,4,2,5],
'q_name': [4,6,8,2,1,4]
})
df = pd.DataFrame(toy_data, columns=['id','group','a_name','b_name','c_not','q_name'])
Edit: Missed this in original post. My objective is to get a variable ;sum" of the summation of all the selected columns as shown below:
You can filter first and then pass df['group'] instead group to groupby, last add sum column by DataFrame.assign:
df1 = (df.filter(regex=r'_name$')
.groupby(df['group']).sum()
.assign(sum = lambda x: x.sum(axis=1)))
ALternative is filter columns names and pass after groupby:
cols = df.filter(regex=r'_name$').columns
df1 = df.groupby('group')[cols].sum()
Or:
cols = df.columns[df.columns.str.contains(r'_name$')]
df1 = df.groupby('group')[cols].sum().assign(sum = lambda x: x.sum(axis=1))
print (df1)
a_name b_name q_name sum
group
a 7 13 10 30
b 10 6 10 26
c 10 2 5 17

Pandas, groupby by 2 non numeric columns

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise
Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

comparing a dataframe column with another data frame

I have two datasets
df1 = pd.DataFrame ({"skuid" :("45","22","32","33"), "country": ("A","B","C","A")})
df2 = pd.DataFrame ({"skuid" :("45","32","40","21"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to find how many rows does df2 have in common with df1 when country is A (only sum).
I want the output as 1, because skuid 45 is in both datasets and country is A.
I did by subsetting by country and using isin() like
df3 = df1.loc[df1['country']=='A']
df3['skuid'].isin(df2['skuid']).value_counts()
but I want to know whether I can do in single line.
Here what I tried to do in one line code
df1.loc['skuid'].isin(df2.skuid[df1.country.eq('A')].unique().sum()):,])
I know my mistake I'm comparing with df1 with df2 of a country that doesn't exist.
So, is there any way where I can do it in one or two lines, without subsetting each country
Thanks in advance
Let's try:
df1.loc[df1['country']=='A', 'skuid'].isin(df2['skuid']).sum()
# out: 1
Or
(df1['skuid'].isin(df2['skuid']) & df1['country'].eq('A')).sum()
You can also do for all countries with groupby():
df1['skuid'].isin(df2['skuid']).groupby(df1['country']).sum()
Output:
country
A 1
B 0
C 1
Name: skuid, dtype: int64
If I correctly understood you need this:
df3=df1[lambda x: (df1.skuid.isin(df2['skuid'])) & (x['country'] =='A') ].count()

How to add mean value of groupby function based on dataframe A to another dataframe in Pandas?

In my notebook I have 3 dataframes.
I would like to calculate the mean age based on Pclass and Age. I achieved this by using a groupby function. The result of the groupby function will override the NaN fields:
avg = traindf_cln.groupby(["Pclass", "Sex"])["Age"].transform('mean')
traindf_cln["Age"].fillna(avg, inplace=True)
validationdf_cln["Age"].fillna(avg, inplace=True)
testdf_cln["Age"].fillna(avg, inplace=True)
The problem is that the code above is only working on the traindf_cln dataframe and not on the other two.
I think the issue is that you can't use a value (of a groupby) of a specific dataframe on another dataframe.
How can I fix this?
Dataframe traindf_cln:
Edit:
New code:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
lookup_keys_val = pd.Series(tuple(zip(validationdf_cln["Pclass"], validationdf_cln["Sex"])))
validationdf_cln["Age"].fillna(lookup_keys_val.map(group), inplace=True)
Few samples of traindf_cln where Age is still NaN. Some did change, but not all of them.
You don't need to use transform, just a groupby object that can then be mapped onto the Pclass and Sex columns of the test/validation DataFrames. Here we create a Series with tuples of Pclass and Sex that can be used to map the groupby values into the missing Age data:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
Then just repeat the final 2 lines using the same group object on the test/validation sets.

Run functions for each row in dataframe

I have a dataframe df1, like this:
date sentence
29/03/1029 I like you
30/03/2019 You eat cake
and run functions getVerb and getObj to dataframe df1. So, the output is like this:
date sentence verb object
29/03/1029 I like you like you
30/03/2019 You eat cake eat cake
I want those functions (getVerb and getObj) run for each line in df1. Could someone help me to solve this problem in an efficient way?
Thank you so much.
Each column of a pandas DataFrame is a Series. You can use the Series.apply or Series.map functions to get the result you want.
df1['verb'] = df1['sentence'].apply(getVerb)
df1['object'] = df1['sentence'].apply(getObj)
# OR
df1['verb'] = df1['sentence'].map(getVerb)
df1['object'] = df1['sentence'].map(getObj)
See the pandas documentation for more details on Series.apply or Series.map.
Assume you have a pandas dataframe such as:
import pandas as pd, numpy as np
df = pd.DataFrame([[4, 9]] *3, columns=['A', 'B'])
>>>df
A B
4 9
4 9
4 9
Let's say, we want sum of columns A and B row wise and column wise. To accomplish it, we write
df.apply(np.sum, axis = 1) # for row-wise sum
Output: 13
13
13
df.apply(np.sum, axis = 0) # for column-wise sum
Output: A 12
B 27
Now, if you want to apply any function for specific set of columns, you may choose a subset from the data-frame.
For example: I want to compute sum over column A only.
df['A'].apply(np.sum, axis =1)
Dataframe.apply
You may refer the above link as well. Other than that, Series.map, Series.apply could be handy as well, as mentioned in the above answer.
Cheers!
Using a simple loop: (assuming that columns already exist in the data frame having names 'verb' and 'object')
for index, row in df1.iterrows():
df1['verb'].iloc[index]= getVerb(row['sentence'])
df1['object'].iloc[index]= getObj(row['sentence'])

Categories

Resources