Concatenate Pandas dataframes with different set of columns - python

df1 has columns A, B, C, D, E
df2 has columns A, B, D
How to concatenate them in order to have a resulting dataframe that has rows of df1 and df2, values of A, B and D will be extended from df2 on df1, and columns C and E will be filled with NaN because df2 has no data for them?

There is a function called concat
pd.concat([df1,df2])
The input must be a iterable, so put them into a list ;)

Related

pandas merge by excluding certain columns from merge

I want to merge two dataframes like:
df1.columns = A, B, C, E, ..., D
df2.columns = A, B, C, F, ..., D
If I merge them, it merges on all columns. Also since the number of columns is high I don't want to specify them in on. I prefer to exclude the columns which I don't want to be merged. How can I do that?
mdf = pd.merge(df1, df2, exclude D)
I expect the result be like:
mdf.columns = A, B, C, E, F ..., D_x, D_y
You mentioned you mentioned you don't want to use on "since the number of columns is much".
You could still use on this way even if there are a lot of columns:
mdf = pd.merge(df1, df2, on=[i for i in df1.columns if i != 'D'])
Or
By using pd.Index.difference
mdf = pd.merge(df1, df2, on=df1.columns.difference(['D']).tolist())
Another solution can be:
mdf = pd.merge(df1, df2, on= df1.columns.tolist().remove('D')
What about dropping the unwanted column after the merge?
You can use pandas.DataFrame.drop:
mdf = pd.merge(df1, df2).drop('D', axis=1)
or dropping before the merge:
mdf = pd.merge(df1.drop('D', axis=1), df2.drop('D', axis=1))
One solution is using intersection and then difference on df1 and df2 columns:
mdf = pd.merge(df1, df2, on=df1.columns.intersection(df2.columns).difference(['D']).tolist())
The other solution could be renaming columns you want to exclude from merge:
df2.rename(columns={"D":"D_y"}, inplace=True)
mdf = pd.merge(df1, df2)

How to drop dataframe rows not in another dataframe?

I have a:
Dataframe df1 with columns A, B and C. A is the index.
Dataframe df2 with columns D, E and F. D is the index.
What’s an efficient way to drop from df1 all rows where B is not found in df2 (in D the index)?
If need drop some not exist values it is same like select only existing values. So is possible use:
You can filter df1.B by index from df2 in Series.isin:
df3 = df1[df1.B.isin(df2.index)]
Or by DataFrame.merge with left join:
df3 = df1.merge(df2[[]], left_on='B', right_index=True, how='left')

Is there a way to merge 2 rows of a df into 1?

I have a df that has plenty of row pairs that need to be condensed into 1. Column B identifies the pairs. All column values except one are identical. Is there a way to accomplish this in pandas?
Existing df:
A B C D E
x c v 2 w
x c v 2 r
Desired Output:
A B C D E
x c v 2 w,r
It's a little bit unintuitive to read but works:
df2 = (
df.groupby('B', as_index=False)
.agg({**dict.fromkeys(df.columns, 'first'), 'E': ','.join})
)
What we're doing here is grouping by column B and indicating that we want the first value occurring for each value of B across all columns, but then we're also over-riding what we want for the column E for aggregation to take place to join E's values sharing identical columns with B with a comma.
Hence you get:
A B C D E
0 x c v 2 w,r
This doesn't make assumptions about data types and leave columns alone that aren't strings but of course will error out if your E column contains non string values (or types that can't logically support it).
Like this:
df = df.apply(lambda x: ','.join(x), axis=0)
To use specific cols
df = df[['A','B']] ....

pandas dataframe multiplication by column, with index matched

df1=pd.DataFrame(np.random.randn(6,3),index=list("ABCDEF"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(6,1),index=list("ABCDEF"))
I want to multiply each column of df1 with df2, and match by index label. That means:
df1["X"]*df2
df1["Y"]*df2
df1["Z"]*df2
The output would have the index and columns of df1.
How can I do this? Tried several ways, and it still didn't work...
Use mul function and multiple DataFrame by Series (column) select by position with iloc:
print(df1.mul(df2.iloc[:,0], axis=0))
X Y Z
A -0.577748 0.299258 -0.021782
B -0.952604 0.024046 -0.276979
C 0.175287 2.507922 0.597935
D -0.002698 0.043514 -0.012256
E -1.598639 0.635508 1.532068
F 0.196783 -0.234017 -0.111166
Detail:
print(df2.iloc[:, 0])
A -2.875274
B 1.881634
C 1.369197
D 1.358094
E -0.024610
F 0.443865
Name: 0, dtype: float64
You can use apply to multiply each column in df1 with df2.
df1.apply(lambda x: x * df2[0], axis=0)
X Y Z
A -0.437749 0.515611 -0.870987
B 0.105674 1.679020 -0.693983
C 0.055004 0.118673 -0.028035
D 0.704775 -1.786515 -0.982376
E 0.109218 -0.021522 -0.188369
F 1.491816 0.105558 -1.814437

Merge 2 dataframes using <> condition

I have two DataFrame objects:
df1: columns = [a, b, c]
df2: columns = [d, e]
I want to merge df1 with df2 using the equivalent of sql in pandas:
select * from df1 inner join df2 on df1.b=df2.e and df1.b <> df2.d and df1.c = 0
The following sequence of steps should get you there:
df1 = df[df1.c==0]
merged = df1.merge(df2, left_on='b', right_on='e')
merge = merged[merged.b != merged.d]

Categories

Resources