Sum up non-unique rows in DataFrame - python

I have a dataframe like this:
id = [1,1,2,3]
x1 = [0,1,1,2]
x2 = [2,3,1,1]
df = pd.DataFrame({'id':id, 'x1':x1, 'x2':x2})
df
id x1 x2
1 0 2
1 1 3
2 1 1
3 2 1
Some rows have the same id. I want to sum up such rows (over x1 and x2) to obtain a new dataframe with unique ids:
df_new
id x1 x2
1 1 5
2 1 1
3 2 1
An important detail is that the real number of columns x1, x2,... is large, so I cannot apply a function that requires manual input of column names.

As discussed you can use pandas groupby function to sum based on the id value:
df.groupby(df.id).sum()
# or
df.groupby('id').sum()
If you need don't want id to become the index then you can:
df.groupby('id').sum().reset_index()
# or
df.groupby('id', as_index=False).sum() # #John_Gait

With pivot_table:
In [31]: df.pivot_table(index='id', aggfunc=sum)
Out[31]:
x1 x2
id
1 1 5
2 1 1
3 2 1

Related

Iterate over columns of Pandas dataframe and create new variables

I am having trouble figuring out how to iterate over variables in a pandas dataframe and perform same arithmetic function on each.
I have a dataframe df that contain three numeric variables x1, x2 and x3. I want to create three new variables by multiplying each by 2. Here’s what I am doing:
existing = ['x1','x2','x3']
new = ['y1','y2','y3']
for i in existing:
for j in new:
df[j] = df[i]*2
Above code is in fact creating three new variables y1, y2 and y3 in the dataframe. But the values of y1 and y2 are being overridden by the values of y3 and all three variables have same values, corresponding to that of y3. I am not sure what I am missing.
Really appreciate any guidance/ suggestion. Thanks.
You are looping something like 9 times here - 3 times for each column, with each iteration overwriting the previous.
You may want something like
for e, n in zip(existing,new):
df[n] = df[e]*2
I would do something more generic
#existing = ['x1','x2','x3']
exisiting = df.columns
new = existing.replace('x','y')
#maybe you need map+lambda/for for each existing string
for (ind_existing, ind_new) in zip(existing,new):
df[new[ind_new]] = df[existing[ind_existing]]*2
#maybe there is more elegant way by using pandas assign function
You can concatenante the original DataFrame with the columns with doubled values:
cols_to_double = ['x0', 'x1', 'x2']
new_cols = list(df.columns) + [c.replace('x', 'y') for c in cols_to_double]
df = pd.concat([df, 2 * df[cols_to_double]], axis=1, copy=True)
df.columns = new_cols
So, if your input df Dataframe is:
x0 x1 x2 other0 other1
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
after executing the previous lines, you get:
x0 x1 x2 other0 other1 y0 y1 y2
0 0 1 2 3 4 0 2 4
1 0 1 2 3 4 0 2 4
2 0 1 2 3 4 0 2 4
3 0 1 2 3 4 0 2 4
4 0 1 2 3 4 0 2 4
Here the code to create df:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=np.column_stack([np.full((5,), i) for i in range(5)]),
columns=[f'x{i}' for i in range(3)] + [f'other{i}' for i in range(2)]
)

Return running count of values in a pandas df

I am trying to return a running count based off two columns in a pandas df.
For the df below I'm trying to determine the count based off Column 'Event' & Column 'Who'.
import pandas as pd
import numpy as np
d = ({
'Event' : ['A','B','E','','C','B','B','B','B','E','C','D'],
'Space' : ['X1','X1','X2','','X3','X3','X3','X4','X3','X2','X2','X1'],
'Who' : ['Home','Home','Even','Out','Home','Away','Home','Away','Home','Even','Away','Home']
})
d = pd.DataFrame(data = d)
I have tried the following.
df = d.groupby(['Event', 'Who'])['Space'].count().reset_index(name="count")
Which produces this:
Event Who count
0 Out 1
1 A Home 1
2 B Away 2
3 B Home 3
4 C Away 1
5 C Home 1
6 D Home 1
7 E Even 2
But I would like it to be a running count rather than a total count.
Can df = d.groupby(['Event', 'Who'['Space'].count().reset_index(name="count") be amended to filter additional constraints or will it have to be a mask function or similar?
So my intended Output is:
A_Away A_Home B_Away B_Home C_Away C_Home D_Away D_Home Event Space Who
0 1 A X1 Home
1 B X1 Home
2 E X2 Even
3 Out
4 1 C X3 Home
5 1 B X3 Away
6 1 B X3 Home
7 B X4 Away
8 2 B X3 Home
9 2 E X2 Even
10 1 C X2 Away
11 1 D X1 Home
So the count gets added to the row. Not a total count for the entire dataset.
Here are the steps needed to get to your result:
Prepare "Who" and "Event" as the index
Get a cumulative count for groups using groupby and cumcount
Reshape/pivot/unstack your DataFrame to a tabular format using unstack
Fix the column headers
Concatenate this result with the original using pd.concat
# set the index
v = df.set_index(['Who', 'Event'], append=True)['Space']
# assign `v` the values for the cumulative count
v[:] = df.groupby(['Event', 'Who']).cumcount().add(1)
# reshape `v`
v = v.unstack([1, 2], fill_value='')
# fix your headers
v.columns = v.columns.map('{0[1]}_{0[0]}'.format)
# concatenate the result
pd.concat([v.loc[:, ~v.columns.str.contains('Out')], df], 1)
A_Home B_Home E_Even C_Home B_Away C_Away D_Home Event Space Who
0 1 A X1 Home
1 1 B X1 Home
2 1 E X2 Even
3 Out
4 1 C X3 Home
5 1 B X3 Away
6 2 B X3 Home
7 2 B X4 Away
8 3 B X3 Home
9 2 E X2 Even
10 1 C X2 Away
11 1 D X1 Home

Filling missing values using means and grouping by logics in Pandas

Have a dataframe in Python like that:
x1 x2 x3
a 1 1000
a 1 2390
a 1 ?
b 2 120
b 2 2000
So my goal is to filling in all the missing values in column x3. But if I'll use standard approach (pd.fillna(df.mean()) I wont get desirable results. I want to be able somehow do not simple mean() of the x3 column but only mean() for x3 for all the values which x1=a and x2=1. How can it be done in Python Pandas?
You can use groupby.transform() to fill missing values by group:
df['x3'] = df.groupby(["x1", "x2"])['x3'].transform(lambda x: x.fillna(x.mean()))
using join and fillna
c = ['x1', 'x2']
df.fillna(df[c].join(df.groupby(c).mean(), on=c))
x1 x2 x3
0 a 1 1000.0
1 a 1 2390.0
2 a 1 1695.0
3 b 2 120.0
4 b 2 2000.0

Replace column values based on another dataframe python pandas - better way?

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources