Sum column in one dataframe based on row value of another dataframe - python

Say, I have one data frame df:
a b c d e
0 1 2 dd 5 Col1
1 2 3 ee 9 Col2
2 3 4 ff 1 Col4
There's another dataframe df2:
Col1 Col2 Col3
0 1 2 4
1 2 3 5
2 3 4 6
I need to add a column sum in the first dataframe, wherein it sums values of columns in the second dataframe df2, based on values of column e in df1.
Expected output
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col4 0
The Sum value in the last row is 0 because Col4 doesn't exist in df2.
What I tried: Writing some lamdas, apply function. Wasn't able to do it.
I'd greatly appreciate the help. Thank you.

Try
df['Sum']=df.e.map(df2.sum()).fillna(0)
df
Out[89]:
a b c d e Sum
0 1 2 dd 5 Col1 6.0
1 2 3 ee 9 Col2 9.0
2 3 4 ff 1 Col4 0.0

Try this. The following solution sums all values for a particular column if present in df2 using apply method and returns 0 if no such column exists in df2.
df1.loc[:,"sum"]=df1.loc[:,"e"].apply(lambda x: df2.loc[:,x].sum() if(x in df2.columns) else 0)

Use .iterrows() to iterate through a data frame pulling out the values for each row as well as index.
A nest for loop style of iteration can be used to grab needed values from the second dataframe and apply them to the first
import pandas as pd
df1 = pd.DataFrame(data={'a': [1,2,3], 'b': [2,3,4], 'c': ['dd', 'ee', 'ff'], 'd': [5,9,1], 'e': ['Col1','Col2','Col3']})
df2 = pd.DataFrame(data={'Col1': [1,2,3], 'Col2': [2,3,4], 'Col3': [4,5,6]})
df1['Sum'] = df1['a'].apply(lambda x: None)
for index, value in df1.iterrows():
sum = 0
for index2, value2 in df2.iterrows():
sum += value2[value['e']]
df1['Sum'][index] = sum
Output:
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col3 15

Related

How to enter the value of one index and column into a new cell with +1 in the iteration?

I have the following DataFrame named df1:
col1
col2
col3
5
3
50
10
4
3
2
0
1
I would like to create a loop that adds a new column called "Total", which takes the value of col1 index 0 (5) and enters that value under the column "Total" at index 0. The next iteration, will col2 index 1 (4) and that value will go under column "Total" at index 1. This step will continue all columns and rows are completed.
The ideal output will be the following:
df1
col1
col2
col3
Total
5
3
50
5
10
4
3
4
2
0
1
1
I have the following code but I would like to find a more efficient way of doing this as I have a large DataFrame:
df1.iloc[0,3] = df1.iloc[0,0]
df1.iloc[1,3] = df1.iloc[1,1]
df1.iloc[2,3] = df1.iloc[2,2]
Thank you!
Numpy has a built in diagonal function:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [5, 10, 2], 'col2': [3, 4, 0], 'col3': [50, 3, 1]})
df['Total'] = np.diag(df)
print(df)
Output
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
You can try apply on rows
df['Total'] = df.apply(lambda row: row.iloc[row.name], axis=1)
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
Hope this logic will help
length = len(df1["col1"])
total = pd.Series([df1.iloc[i, i%3] for i in range(length)])
# in i%3, 3 is number of cols(col1, col2, col3)
# add this total Series to df1

iloc[] by value columns

I want to use iloc with value in column.
df1 = pd.DataFrame({'col1': ['1' ,'1','1','2','2','2','2','2','3' ,'3','3'],
'col2': ['A' ,'B','C','D','E','F','G','H','I' ,'J','K']})
I want to select index 2 in each column value as data frame and the result will be like
col1 col2
1 C
2 F
3 K
Thank you so much
Use GroupBy.nth:
df2 = df1.groupby('col1', as_index=False).nth(2)
Alternative with GroupBy.cumcount:
df2 = df1[df1.groupby('col1').cumcount().eq(2)]
print (df2)
col1 col2
2 1 C
5 2 F
10 3 K
Use GroupBy.nth with as_index=False:
df1.groupby('col1', as_index=False).nth(2)
output:
col1 col2
2 1 C
5 2 F
10 3 K
df1.groupby('col1').agg(lambda ss:ss.iloc[2])
col2
col1
1 C
2 F
3 K

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Create new Dataframe from matching two dataframe index's

I'm looking create a new dataframe from data in two separate dataframes - effectively matching the index of each cell and input into a two column dataframe. My real datasets have the exact same number of rows and columns, FWIW. Example below:
DF1:
Col1 Col2 Col3
1 2 3
3 8 7
DF2:
Col1 Col2 Col3
A B E
R S W
Desired Dataframe:
Col1 Col2
1 A
2 B
3 E
3 R
8 S
7 W
Thank you for your help!
here is your code
df3 = pd.Series(df1.values.ravel('F'))
df4 = pd.Series(df2.values.ravel('F'))
df = pd.concat([df3, df4], axis=1)
Use, DataFrame.to_numpy and .flatten:
df = pd.DataFrame(
{'Col1': df1.to_numpy().flatten(), 'Col2': df2.to_numpy().flatten()})
# print(df)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
You can do it easily like so:
list1 = df1.values.tolist()
list1 = [item for sublist in list1 for item in sublist]
list2 = df2.values.tolist()
list2 = [item for sublist in list2 for item in sublist]
df = {
'Col1': list1,
'Col2': list2
}
df = DataFrame(df)
print(df)
Hope this helps :)
pd.concat(map(lambda x: x.unstack().sort_index(level=-1), (df1, df2)), axis=1).reset_index(drop=True).rename(columns=['Col1', 'Col2'].__getitem__)
Result:
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W
Another way (alternative):
pd.concat((df1.stack(),df2.stack()),axis=1).add_prefix('Col').reset_index(drop=True)
or:
d = {'Col1':df1,'Col2':df2}
pd.concat((v.stack() for k,v in d.items()),axis=1,keys=d.keys()).reset_index(drop=True)
#or pd.concat((d.values()),keys=d.keys()).stack().unstack(0).reset_index(drop=True)
Col1 Col2
0 1 A
1 2 B
2 3 E
3 3 R
4 8 S
5 7 W

Sort and align 2 dataframes by values in corresponding columns

I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN

Categories

Resources