Combining dummies and count for pandas dataframe - python

I have a pandas dataframe like this:
as a plain text:
{'id;sub_id;value;total_stuff related to id and sub_id':
['aaa;1;cat;10', 'aaa;1;cat;10', 'aaa;1;dog;10', 'aaa;2;cat;7',
'aaa;2;dog;7', 'aaa;3;cat;5', 'bbb;1;panda;20', 'bbb;1;cat;20',
'bbb;2;panda;12']}
The desired output I want is this.
Note that there are many different "values" possible, so I would need to automate the creation of dummies variables (nb_animals).
But these dummies variables must contain the number of occurences by id and sub_id.
The total_stuff is always the same value for a given id/sub_id combination.
I've tried using get_dummies(df, columns = ['value']), which gave me this table.
using get_dummies
as a plain text:
{'id;sub_id;value_cat;value_dog;value_panda;total_stuff related to id
and sub_id': ['aaa;1;2;1;0;10', 'aaa;1;2;1;0;10', 'aaa;1;2;1;0;10',
'aaa;2;1;1;0;7', 'aaa;2;1;1;0;7', 'aaa;3;1;0;0;5', 'bbb;1;1;0;1;20',
'bbb;1;1;0;1;20', 'bbb;2;0;0;1;12']}
I'd love to use some kind of df.groupby(['id','sub_id']).agg({'value_cat':'sum', 'value_dog':'sum', ... , 'total_stuff':'mean'}), but writing all of the possible animal values would be too tedious.
So how to get a proper aggregated count/sum for values, and average for total_stuff (since total_stuff is unique per id/sub_id combination)
Thanks
EDIT : Thanks chikich for the neat answer. The agg_dict is what I needed

Use pd.get_dummies to transform categorical data
df = pd.get_dummies(df, prefix='nb', columns='value')
Then group by id and subid
agg_dict = {key: 'sum' for key in df.columns if key[:3] == 'nb_'}
agg_dict['total_stuff'] = 'mean'
df = df.groupby(['id', 'subid']).agg(agg_dict).reset_index()

Related

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

Pandas, groupby by 2 non numeric columns

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise
Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

How to add mean value of groupby function based on dataframe A to another dataframe in Pandas?

In my notebook I have 3 dataframes.
I would like to calculate the mean age based on Pclass and Age. I achieved this by using a groupby function. The result of the groupby function will override the NaN fields:
avg = traindf_cln.groupby(["Pclass", "Sex"])["Age"].transform('mean')
traindf_cln["Age"].fillna(avg, inplace=True)
validationdf_cln["Age"].fillna(avg, inplace=True)
testdf_cln["Age"].fillna(avg, inplace=True)
The problem is that the code above is only working on the traindf_cln dataframe and not on the other two.
I think the issue is that you can't use a value (of a groupby) of a specific dataframe on another dataframe.
How can I fix this?
Dataframe traindf_cln:
Edit:
New code:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
lookup_keys_val = pd.Series(tuple(zip(validationdf_cln["Pclass"], validationdf_cln["Sex"])))
validationdf_cln["Age"].fillna(lookup_keys_val.map(group), inplace=True)
Few samples of traindf_cln where Age is still NaN. Some did change, but not all of them.
You don't need to use transform, just a groupby object that can then be mapped onto the Pclass and Sex columns of the test/validation DataFrames. Here we create a Series with tuples of Pclass and Sex that can be used to map the groupby values into the missing Age data:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
Then just repeat the final 2 lines using the same group object on the test/validation sets.

How to filter out rows of pandas df that contain values in 'set' type which contain certain strings?

I have some dataframe that contain a column with values in 'set' type.
I also have a list of words I wish to search in these sets and drop rows that contain a hit with the list
e.g. df strcuture
id types
123 {'Editorial', "Research Support, Non-U.S. Gov't", 'Comment'}
234 {'Comparative Study', 'Journal Article', "Research Support,'Research Support, N.I.H., Extramural'}
And this is my list of values to drop
list_to_drop=['Editorial','Comment']
In this example I wish to drop the first row
Thanks!
Use isdisjoint with filter by boolean indexing in map:
df = df[df['types'].map(set(list_to_drop).isdisjoint)]
print (df)
id types
1 234 {Comparative Study, Research Support, N.I.H., ...
Use the below code with apply and difference:
df['types'] = df['types'].apply(lambda x: x.difference(list_to_drop))
You can use instead map with issubset:
df[~df.types.map(set(list_to_drop).issubset)]

Categories

Resources