Filling a column based on comparison of 2 other columns (pandas)

Filling a column based on comparison of 2 other columns (pandas) - python

I am trying to do the following in pandas:
I have 2 DataFrames, both of which have a number of columns.
DataFrame 1 has a column A, that is of interest for my task;
DataFrame 2 has columns B and C, that are of interest.
What needs to be done: to go through the values in column A and see if the same values exists somewhere in column B. If it does, create a column D in Dataframe 1 and fill its respective cell with the value from C which is on the same row as the found value from B.
If the value from A does not exist in B, then fill the cell in D with a zero.
for i in range(len(df1)):
if df1['A'].iloc[i] in df2.B.values:
df1['D'].iloc[i] = df2['C'].iloc[i]
else:
df1['D'].iloc[i] = 0
This gives me an error: Keyword 'D'. If I create the column D in advance and fill it, for example, with 0's, then I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. How can I solve this? Or is there a better way to accomplish what I'm trying to do?
Thank you so much for your help!

If I understand correctly:
Given these 2 dataframes:
import pandas as pd
import numpy as np
np.random.seed(42)
df1=pd.DataFrame({'A':np.random.choice(list('abce'), 10)})
df2=pd.DataFrame({'B':list('abcd'), 'C':np.random.randn(4)})
>>> df1
A
0 c
1 e
2 a
3 c
4 c
5 e
6 a
7 a
8 c
9 b
>>> df2
B C
0 a 0.279041
1 b 1.010515
2 c -0.580878
3 d -0.525170
You can achieve what you want using a merge:
new_df = df1.merge(df2, left_on='A', right_on='B', how='left').fillna(0)[['A','C']]
And then just rename the columns:
new_df.columns=['A', 'D']
>>> new_df
A D
0 c -0.580878
1 e 0.000000
2 a 0.279041
3 c -0.580878
4 c -0.580878
5 e 0.000000
6 a 0.279041
7 a 0.279041
8 c -0.580878
9 b 1.010515

Related

Adding pandas series on end of each pandas dataframe's row

I've had issues finding a concise way to append a series to each row of a dataframe, with the series labels becoming new columns in the df. All the values will be the same on each of the dataframes' rows, which is desired.
I can get the effect by doing the following:
df["new_col_A"] = ser["new_col_A"]
.....
df["new_col_Z"] = ser["new_col_Z"]
But this is so tedious there must be a better way, right?

Given:
# df
A B
0 1 2
1 1 3
2 4 6
# ser
C a
D b
dtype: object
Doing:
df[ser.index] = ser
print(df)
Output:
A B C D
0 1 2 a b
1 1 3 a b
2 4 6 a b

Populate a column in a dataframe based on if statement for column in another dataframe - Python

Lets say dataframe 1, df1 looks like the following
A B C
1 2 a
3 4 c
3 4 e
And I want to create a column D, only if column C values matches the column B in dataframe 2, where df2 looks like the following:
A B C D E
1 2 a,d 4 5
2 3 d,c 3 6
3 4 f,e,j 7 2
If df1['C'] == df2['C'], return the corresponding value in df2['D']
So the result I would want in df1 new column D is
A B C D
1 2 a 4
3 4 c 3
3 4 e 7
As you can see, in df2['C'] it has multiple values in the column as long as df1['C'] matches one of them, then condition is fulfilled and new column 'D' should be populated
I have tried df1['D'] = np.where(df1['C']==df2['C'], df2['D']), it did not work.
Your assistance is much appreciated, thank you.

You can use df2.C.str.split(',') to create a python list from your column C and then use zip to create a tuple for every row on each dataframe. The list comprehension will use s1 in s2 to generate a list with True or False values that can be used over the df2.D to populate the new D column in df1.
contain = [s1 in s2 for s1, s2 in zip(df1.C, df2.C.str.split(','))]
df1['D'] = df2.D[contain]

Append only matching columns to dataframe

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to
df:
A B C
1 2 3
df_to_append:
A B C D E
6 7 8 9 0
The problem is that when I use df.append(), It also appends the unmatched columns to df.
df = df.append(df_to_append, ignore_index=True)
Out:
A B C D E
1 2 3 NaN NaN
6 7 8 9 0
But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.

Using concat join='inner
pd.concat([df,df_to_append],join='inner')
Out[162]:
A B C
0 1 2 3
0 6 7 8

Just select the columns common to both dfs:
df.append(df_to_append[df.columns], ignore_index=True)

The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:
cols = list(set(df.columns) & set(df_to_append.columns))
df.append(df_to_append[cols], ignore_index=True)

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!

Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13

pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

total no. of combinations of a column with other in pandas df

i have a table in pandas df
id_x id_y
a b
b c
c d
d a
b a
and so on around (1000 rows)
i want to find the count of combinations for each id_x with id_y.
ie. a has combinations with a-b,d-a(total 2 combinations)
similarly b has total 2 combinations(b-c) and also a-b to be considered as a combination for b( a-b = b-a)
and create a dataframe df2 which has
id combinations
a 2
b 2
c 2 #(c-d and b-c)
d 1
and so on ..(distinct product_id_'s)
i tried doing this code
df.groupby(['id_x']).size().reset_index()
but getting wrong result;
id_x 0
0 a 1
1 b 1
2 c 1
3 d 1
what approach should i follow?
my skills on python are at a beginner level.
Thanks in advance.

You can first sort all rows by apply sorted, then create Series by stack and last value_counts:
df = df.apply(sorted,axis=1).drop_duplicates().stack().value_counts()
print (df)
d 2
a 2
b 2
c 2
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling a column based on comparison of 2 other columns (pandas) - python

Related

Adding pandas series on end of each pandas dataframe's row

Populate a column in a dataframe based on if statement for column in another dataframe - Python

Append only matching columns to dataframe

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

total no. of combinations of a column with other in pandas df

Categories

Resources