I have two pandas data frames, I want to get the sum of items_bought for each ID in DF1. Then add a column to DF2 containing the sum of items_bought calculated from DF1 with matching ID else fill it with 0. How can I do this in an elegant and efficient manner?
DF1
ID | items_bought
1 5
3 8
2 2
3 5
4 6
2 2
DF2
ID
1
2
8
3
2
Desired Result: DF2 Becomes
ID | items_bought
1 5
2 4
8 0
3 13
2 4
df1.groupby('ID').sum().loc[df2.ID].fillna(0).astype(int)
Out[104]:
items_bought
ID
1 5
2 4
8 0
3 13
2 4
Work on df1 to calculate the sum for each ID.
The resulting dataframe is now indexed by ID, so you can select with df2 IDs by calling loc.
Fill the gaps with fillna.
NA are handled by float type. Now that they are removed, convert the column back to integer.
Solution with groupby and sum, then reindex with fill_value=0 and last reset_index:
df2 = df1.groupby('ID').items_bought.sum().reindex(df2.ID, fill_value=0).reset_index()
print (df2)
ID items_bought
0 1 5
1 2 4
2 8 0
3 3 13
4 2 4
Related
I wish to update column A in dfA with column B in dfB on index overlap
dfA
A
1 1
2 1
3 1
4 1
dfB
B
3 2
4 2
5 2
6 2
Desired result:
A
1 1
2 1
3 2
4 2
What is the simplest way to do this?
Try with update
dfA.update(dfB.rename(columns = {'B':'A'}))
I have two dataframes (df1 and df2). They have similar columns, but one of them might have one or two columns missing, because they come from scraping a website that not always returns the same information.
Lets say:
df1
Index
Column A
Column B
Column C
0
1
3
5
1
2
4
6
df2
Index
Column A
Column C
0
7
9
1
8
10
My expected result would be:
Index
Column A
Column B
Column C
0
1
3
5
1
2
4
6
2
7
NaN
9
3
8
NaN
10
How can I do this?
>>> pd.concat([df1, df2], axis=0).reset_index(drop=True)
Column A Column B Column C
0 1 3.0 5
1 2 4.0 6
2 7 NaN 9
3 8 NaN 10
Edit: pd.concat only works on data with unique indices. If your original dataframes contain duplicate indices, try this:
pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=0).reset_index(drop=True)
My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2
I have 3 DataFrames, all with over 100 rows and 1000 columns. I am trying to combine all these DataFrames into one in such a way that common columns from each DataFrame are summed up. I understand there is a method of summation called "pd.DataFrame.sum()", but remember, I have over 1000 columns and I can not add each common column manually. I am attaching sample DataFrames and the result I want. Help will be appreciated.
#Sample DataFrames.
df_1 = pd.DataFrame({'a':[1,2,3],'b':[2,1,0],'c':[1,3,5]})
df_2 = pd.DataFrame({'a':[1,1,0],'b':[2,1,4],'c':[1,0,2],'d':[2,2,2]})
df_3 = pd.DataFrame({'a':[1,2,3],'c':[1,3,5], 'x':[2,3,4]})
#Result.
df_total = pd.DataFrame({'a':[3,5,6],'b':[4,2,4],'c':[3,6,12],'d':[2,2,2], 'x':[2,3,4]})
df_total
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Let us do pd.concat then sum
out = pd.concat([df_1,df_2,df_3],axis=1).sum(level=0,axis=1)
Out[7]:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
You can add with fill_value=0:
df_1.add(df_2, fill_value=0).add(df_3, fill_value=0).astype(int)
Output:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Note: pandas intrinsically aligns most operations along indexes (index and column headers).
I want to filter a dataframe based on two conditions on two different columns. In the example below, I want to filter the dataframe df to contain rows such that it contains uids with value counts for the val column greater than 4 is more than 2.
df = pd.DataFrame({'uid':[1,1,1,2,2,3,3,4,4,4],'iid':[11,12,13,12,13,13,14,14,11,12], 'val':[3,4,5,3,5,4,5,4,3,4]})
For this dataframe, my output should be
df
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
Here, I filtered out the uid 2 becuase number of rows with uid == 2 and val >= 4 is less than 2. I want to keep only uid rows for which number of val with values greater than 4 is greater than or equal to 2.
you need groupby.transform with sum once check where val is greater or equal ge than 4. and check that the result is ge to use it as a boolean filter on df.
print (df[df['val'].ge(4).groupby(df['uid']).transform(sum).ge(2)])
uid iid val
0 1 11 3
1 1 12 4
2 1 13 5
5 3 13 4
6 3 14 5
7 4 14 4
8 4 11 3
9 4 12 4
EDIT: another way to avoid groupby.transform is to loc the rows where val is ge than 4 and the column uid, use value_counts on it and get True where ge 2. then map back to the uid column to create the boolean filter on df. same result and potentially faster.
df[df['uid'].map(df.loc[df['val'].ge(4), 'uid'].value_counts().ge(2))]