Pandas DataFrame grouping - python

I have a Dataframe that looks like the following:
enter image description here
The dataframe counts the number of question according to their state:
question_count_data.columns = ['date', 'curriculum_name_en', 'concept', 'language',
'concept_name_en', 'concept_name_tc', 'state', 'question_count']
question_count_data['state'] = question_count_data['state']\
.map({10: 'DRAFT', 20: 'REVIEW', 30: 'PUBLISHED', 40: 'ERROR', 50: 'DISABLED'})
I have used the following method to create this dataframe:
question_count_data = df_question.groupby(['date', 'concept__curriculum__name_en', 'concept',
'language', 'concept_name_en', 'concept_name_tc', 'state', ],
as_index=False)['question_count'].sum()
I want to now create separate columns for each state DRAFT, REVIEW, PUBLISHED, etc and provide the question count in rows , that has to look like the following :
enter image description here
Whats the best possible way to do this using my question_count_data Dataframe? I dont want to change the groupby method already implemented because thats what providing me the question count.
I do not think having another groupby method would be possible solution because what i ultimately want to do is getting the row value of the column State and getting them to separate columns like Draft, Review, Published, etc and then provide the count for each date.
A detailed explanation would be helpful please.

You are really close, need to remove as_index=False for Series with MultiIndex and then reshape by Series.unstack:
cols = ['date', 'concept__curriculum__name_en', 'concept',
'language', 'concept_name_en', 'concept_name_tc', 'state']
question_count_data = (df_question.groupby(cols)['question_count']
.sum()
.unstack(fill_value=0))

Related

Combining dummies and count for pandas dataframe

I have a pandas dataframe like this:
as a plain text:
{'id;sub_id;value;total_stuff related to id and sub_id':
['aaa;1;cat;10', 'aaa;1;cat;10', 'aaa;1;dog;10', 'aaa;2;cat;7',
'aaa;2;dog;7', 'aaa;3;cat;5', 'bbb;1;panda;20', 'bbb;1;cat;20',
'bbb;2;panda;12']}
The desired output I want is this.
Note that there are many different "values" possible, so I would need to automate the creation of dummies variables (nb_animals).
But these dummies variables must contain the number of occurences by id and sub_id.
The total_stuff is always the same value for a given id/sub_id combination.
I've tried using get_dummies(df, columns = ['value']), which gave me this table.
using get_dummies
as a plain text:
{'id;sub_id;value_cat;value_dog;value_panda;total_stuff related to id
and sub_id': ['aaa;1;2;1;0;10', 'aaa;1;2;1;0;10', 'aaa;1;2;1;0;10',
'aaa;2;1;1;0;7', 'aaa;2;1;1;0;7', 'aaa;3;1;0;0;5', 'bbb;1;1;0;1;20',
'bbb;1;1;0;1;20', 'bbb;2;0;0;1;12']}
I'd love to use some kind of df.groupby(['id','sub_id']).agg({'value_cat':'sum', 'value_dog':'sum', ... , 'total_stuff':'mean'}), but writing all of the possible animal values would be too tedious.
So how to get a proper aggregated count/sum for values, and average for total_stuff (since total_stuff is unique per id/sub_id combination)
Thanks
EDIT : Thanks chikich for the neat answer. The agg_dict is what I needed
Use pd.get_dummies to transform categorical data
df = pd.get_dummies(df, prefix='nb', columns='value')
Then group by id and subid
agg_dict = {key: 'sum' for key in df.columns if key[:3] == 'nb_'}
agg_dict['total_stuff'] = 'mean'
df = df.groupby(['id', 'subid']).agg(agg_dict).reset_index()

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

Joining dataframes based on values, pandas

I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..

how to fix the issue of CategoricalIndex column in pandas?

I am working with chicago crime data and want to aggregated count of top 5 crimes for each region/community area. However, my code works but I got unwanted index and CategoricalIndex type column in dataframe columns which stop me to access particular columns for further data manipulation.
what I did:
crimes_2012 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', sep=',', error_bad_lines=False)
df=crimes_2012[['Primary Type', 'Location Description', 'Community Area']]
crime_catg = df.groupby(['Community Name', 'Primary Type'])['Primary Type'].count().unstack()
crime_catg = crime_catg[['THEFT','BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'ASSAULT']]
crime_catg = crime_catg.dropna()
here is my current output that needs to be improved:
here is my attempt:
when I tried below code, I still didn't get new index and index name displayed strange in output dataframe. why? how to fix this? any idea? Thanks
even when I tried to reindex dataframe it didn't get new index after all.
crime_catg.reindex(inplace=True, drop=True)
any idea to fix this issue? any thought?
There are a couple of ways to handle this.
1) Keep the CategoricalIndex type and the use .add_categories method to update valid categories eg to fix your .reindex problem:
crime_catg.columns = crime_catg.columns.add_categories(['Community Name'])
2) Cast as pandas.Index:
crime_catg.columns = pd.Index(list(crime_catg.columns))

Adding Subtotals to Pandas Groupby

I am looking for a cleaner way to add subtotals to Pandas groupby.
Here is my DataFrame:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 50),
'Sub-Category':np.random.choice( ['X','Y'], 50),
'Product':np.random.choice( ['Product 1','Product 2'], 50),
'Units_Sold':np.random.randint(1,100, size=(50)),
'Dollars_Sold':np.random.randint(100,1000, size=50),
'Date':np.random.choice( pd.date_range('1/1/2011','03/31/2011',
freq='D'), 50, replace=False)})
From there, I create a new Groupby Dataframe as such:
df1 = df.groupby(['Category','Sub-Category','Product',pd.TimeGrouper(key='Date',freq='M')]).agg({'Units_Sold':'sum','Dollars_Sold':'sum'}).unstack().fillna(0)
I would like to provide sub-totals for both Category & Sub-Category. I can do this using this code:
df2 = df1.groupby(level=[0,1]).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0),
df2.index.get_level_values(1) + ' Total',
len(df2) * ['']])
df3 = df1.groupby(level=[0]).sum()
df3.index = pd.MultiIndex.from_arrays([df3.index.get_level_values(0) + ' Total',
len(df3) * [''],
len(df3) * ['']])
pd.concat([df1,df2,df3]).sort_index()
This gives me the DataFrame I want:
Final DataFrame
My question - is there a more pythonic way to do this than to have to create a new DataFrame for each level then concat together? I have researched this, but can not find a better way. I have to do this for many different MultiIndex dataframes & am seeking a better solution.
Thanks in advance for your help!
EDIT WITH ADDITIONAL INFORMATION:
Thank you to both #Wen & #DaFanat for their replies. I attempted to use the link #Wen provided on my data [link]:Python (Pandas) Add subtotal on each lvl of multiindex dataframe
pd.concat([df.assign(\
**{x: 'Total' for x in "CategorySub-CategoryProduct"[i:]}\
).groupby(list('abc')).sum() for i in range(1,4)])\
.sort_index()
This sums the total, however it ignores the dates that make up the second level of the columns. It leaves me with this outcome.Resulting Image
I've tried to add in a TimeGrouper with the groupby, but that returns an error. Any help would be greatly appreciated. Thanks!
I can get you a lot closer by aligning your attempt above with the example from #piRSquared.
The list must match the MultiIndex. Try this instead:
iList = ['Category','Sub-Category','Product']
pd.concat([
df1.assign(
**{x: '' for x in iList[i:]}
).groupby(iList).sum() for i in range(1,4)
]).sort_index()
It doesn't present the word "Total" in the right place, nor are the totals at the bottom of each group, but at least it's more-or-less functionally correct. My totals won't match because the values in the DataFrame are random.
It took me a while to work through the original answer provided in Python (Pandas) Add subtotal on each lvl of multiindex dataframe. But the same logic applies here.
The assign() replaces the values in the columns with what is in the dict that is returned by the dict comprehension executed over the elements of the list of MultiIndex columns.
Then groupby() only finds unique values for those non-blanked-out columns and sums them accordingly.
These groupbys are enclosed in a list comprehension, so pd.concat() then just combines these sets of rows.
And sort_index() puts the index labels in ascending order.
(Yes, you still get a warning about "both a column name and an index level," but it still works.)

Categories

Resources