how to join and sum columns (of different lengths) in pandas?

how to join and sum columns (of different lengths) in pandas? - python

lets say I have the following dataframes
df1
name value
a 3
b 4
c 5
df2
name value
b 2
a 1
and I want to make a dataframe like this (there can be many value columns)
name value
a 4
b 6
c 5
Does anyone know how I would do this?

You can temporarily set "name" as index:
df1.set_index('name').add(df2.set_index('name'), fill_value=0).reset_index()
Output:
name value
0 a 4.0
1 b 6.0
2 c 5.0

df2.set_index("name").reindex(df1.name).fillna(0).astype(int)+df1.set_index("name")
out：
value
name
a 4
b 6
c 5

Related

Find and match elements in a column and change the values of corresponding rows in another column

I have a DataFrame that looks like this:
df = pd.DataFrame({'ID':['A','B','A','C','C'], 'value':[2,4,9,1,3.5]})
df
ID value
0 A 2.0
1 B 4.0
2 A 9.0
3 C 1.0
4 C 3.5
What I need to do is to go through ID column and for each unique value, find that row, and multiply the corresponding row in value column based on the reference that I have.
For example, if I have the following reference:
if A multiply by 10
if B multiply by 3
if C multiply by 2
Then the desired output would be:
df
ID value
0 A 2.0*10
1 B 4.0*3
2 A 9.0*10
3 C 1.0*2
4 C 3.5*2
Thanks in advance.

Use Series.map with dictionary for Series used for multiple column value:
d = {'A':10, 'B':3,'C':2}
df['value'] = df['value'].mul(df['ID'].map(d))
print (df)
ID value
0 A 20.0
1 B 12.0
2 A 90.0
3 C 2.0
4 C 7.0
Detail:
print (df['ID'].map(d))
0 10
1 3
2 10
3 2
4 2
Name: ID, dtype: int64

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way

import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5

You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22

You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!

Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13

pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!

You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html

You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6

df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to join and sum columns (of different lengths) in pandas? - python

lets say I have the following dataframes df1 name value a 3 b 4 c 5 df2 name value b 2 a 1 and I want to make a dataframe like this (there can be many value columns) name value a 4 b 6 c 5 Does anyone know how I would do this?

You can temporarily set "name" as index: df1.set_index('name').add(df2.set_index('name'), fill_value=0).reset_index() Output: name value 0 a 4.0 1 b 6.0 2 c 5.0

df2.set_index("name").reindex(df1.name).fillna(0).astype(int)+df1.set_index("name") out： value name a 4 b 6 c 5

Related

Find and match elements in a column and change the values of corresponding rows in another column

Compare columns in Pandas between two unequal size Dataframes for condition check

Returning dataframe of multiple rows/columns per one row of input

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

How do I combine two columns within a dataframe in Pandas?

Categories

Resources