Pandas, groupby by 2 non numeric columns - python

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise

Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

Related

Combining dummies and count for pandas dataframe

I have a pandas dataframe like this:
as a plain text:
{'id;sub_id;value;total_stuff related to id and sub_id':
['aaa;1;cat;10', 'aaa;1;cat;10', 'aaa;1;dog;10', 'aaa;2;cat;7',
'aaa;2;dog;7', 'aaa;3;cat;5', 'bbb;1;panda;20', 'bbb;1;cat;20',
'bbb;2;panda;12']}
The desired output I want is this.
Note that there are many different "values" possible, so I would need to automate the creation of dummies variables (nb_animals).
But these dummies variables must contain the number of occurences by id and sub_id.
The total_stuff is always the same value for a given id/sub_id combination.
I've tried using get_dummies(df, columns = ['value']), which gave me this table.
using get_dummies
as a plain text:
{'id;sub_id;value_cat;value_dog;value_panda;total_stuff related to id
and sub_id': ['aaa;1;2;1;0;10', 'aaa;1;2;1;0;10', 'aaa;1;2;1;0;10',
'aaa;2;1;1;0;7', 'aaa;2;1;1;0;7', 'aaa;3;1;0;0;5', 'bbb;1;1;0;1;20',
'bbb;1;1;0;1;20', 'bbb;2;0;0;1;12']}
I'd love to use some kind of df.groupby(['id','sub_id']).agg({'value_cat':'sum', 'value_dog':'sum', ... , 'total_stuff':'mean'}), but writing all of the possible animal values would be too tedious.
So how to get a proper aggregated count/sum for values, and average for total_stuff (since total_stuff is unique per id/sub_id combination)
Thanks
EDIT : Thanks chikich for the neat answer. The agg_dict is what I needed
Use pd.get_dummies to transform categorical data
df = pd.get_dummies(df, prefix='nb', columns='value')
Then group by id and subid
agg_dict = {key: 'sum' for key in df.columns if key[:3] == 'nb_'}
agg_dict['total_stuff'] = 'mean'
df = df.groupby(['id', 'subid']).agg(agg_dict).reset_index()

i want to extract dataframe that meet certain conditions using python, pandas

I call Excel data with the tuples Time, Name, Good, Bad using python and pandas.
I want to reprocess dataframe to another dataframe that meet certain conditions.
In detail, i would like to print out a dataframe that stores the sum of Good and Bad data for each Name during the entire time.
please help me anybody who knows well python, pandas.
enter image description here
First aggregate sum by DataFrame.groupby, change columns names by DataFrame.add_prefix, add new column by DataFrame.assign and last convert index to column by DataFrame.reset_index:
df = pd.DataFrame({
'Name':list('aaabbb'),
'Bad':[1,3,5,7,1,0],
'Good':[5,3,6,9,2,4]
})
df1 = (df.groupby('Name')['Good','Bad']
.sum()
.add_prefix('Total_')
.assign(Total_Count = lambda x: x.sum(axis=1))
.reset_index())
print (df1)
Name Total_Good Total_Bad Total_Count
0 a 14 9 23
1 b 15 8 23
Use pandas NamedAgg with eval,
df.groupby('Name')[['Good', 'Bad']]\
.agg(Total_Good=('Good','sum'),
Total_Bad=('Bad', 'sum'))\
.eval('Total_Count = Total_Good + Total_Bad')

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

How to obtain the content of a pandas multilevel index entry?

I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.
As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]
You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.
A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.

Pandas: best way to add a total row that calculates the sum of specific (multiple) columns while preserving the data type

I am trying to create a row at the bottom of a dataframe to show the sum of certain columns. I am under the impression that this shall be a really simple operation, but to my surprise, none of the methods I found on SO works for me in one step.
The methods that I've found on SO:
df.loc['TOTAL'] = df.sum()
This doesn't work for me as long as there are non-numeric columns in the dataframe. I need to select the columns first and then concat the non-numeric columns back
df.append(df.sum(numeric_only=True), ignore_index=True)
This won't preserve my data types. Integer column will be converted to float.
df3.loc['Total', 'ColumnA']= df['ColumnA'].sum()
I can only use this to sum one column.
I must have missed something in the process as this is not that hard an operation. Please let me know how I can add a sum row while preserving the data type of the dataframe.
Thanks.
Edit:
First off, sorry for the late update. I was on the road for the last weekend
Example:
df1 = pd.DataFrame(data = {'CountyID': [77, 95], 'Acronym': ['LC', 'NC'], 'Developable': [44490, 56261], 'Protected': [40355, 35943],
'Developed': [66806, 72211]}, index = ['Lehigh', 'Northampton'])
What I want to get would be
Please ignore the differences of the index.
It's a little tricky for me because I don't need to get the sum for the column 'County ID' since it's for specific indexing. So the question is more about getting the sum of specific numeric columns.
Thanks again.
Here is some toy data to use as an example:
df = pd.DataFrame({'A':[1.0,2.0,3.0],'B':[1,2,3],'C':['A','B','C']})
So that we can preserve the dtypes after the sum, we will store them as d
d = df.dtypes
Next, since we only want to sum numeric columns, pass numeric_only=True to sum(), but follow similar logic to your first attempt
df.loc['Total'] = df.sum(numeric_only=True)
And finally, reset the dtypes of your DataFrame to their original values.
df.astype(d)
A B C
0 1.0 1 A
1 2.0 2 B
2 3.0 3 C
Total 6.0 6 NaN
To select the numeric columns, you can do
df_numeric = df.select_dtypes(include = ['int64', 'float64'])
df_num_cols = df_numeric.columns
Then do what you did first (using what I found here)
df.loc['Total'] = pd.Series(df[df_num_cols].sum(), index = [df_num_cols])

Categories

Resources