Is it possible to merge two pandas dataframes based on column name? - python

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
)
right = pd.DataFrame(
{
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
},
)
Is it possible to merge them based on col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C0", "B0", "A0"],
},
)

You can do it with a pretty straightforward .map:
In [319]: left['Val'] = left['Col'].map(right.T[0])
In [320]: left
Out[320]:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0

Try join with the transposition (T or transpose()):
import pandas as pd
left = pd.DataFrame({
"Col": ["D", "C", "B", "A"],
})
right = pd.DataFrame({
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
})
new_df = left.join(right.T, on='Col').rename(columns={0: 'Val'})
print(new_df)
new_df:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0

Related

pandas MultiIndex intersection on partial levels

say I have two dataframes with multiindices, where one of the indices is deeper than the other. Now I want to select only those rows from the one (deeper) dataframe where their partial index is included in the other dataframe.
Example input:
df = pandas.DataFrame(
{
"A": ["a1", "a1", "a1", "a2", "a2", "a2"],
"B": ["b1", "b1", "b2", "b1", "b2", "b2"],
"C": ["c1", "c2", "c1", "c1", "c1", "c2"],
"V": [1, 2, 3, 4, 5, 6],
}
).set_index(["A", "B", "C"])
df2 = pandas.DataFrame(
{
"A": ["a1", "a1", "a2", "a2"],
"B": ["b1", "b3", "b1", "b3"],
"X": [1, 2, 3, 4]
}
).set_index(["A", "B"])
Visual:
V
A B C
a1 b1 c1 1
c2 2
b2 c1 3
a2 b1 c1 4
b2 c1 5
c2 6
X
A B
a1 b1 1
b3 2
a2 b1 3
b3 4
Desired output:
result = pandas.DataFrame(
{
"A": ["a1", "a1", "a2"],
"B": ["b1", "b1", "b1"],
"C": ["c1", "c2", "c1"],
"V": [1, 2, 4],
}
).set_index(["A", "B", "C"])
Visual:
V
A B C
a1 b1 c1 1
c2 2
a2 b1 c1 4
I tried
df.loc[df2.index] and df.loc[df.index.intersection(df2.index)] but that does not work.
I guess I could do df.join(df2, how="inner") and afterwards remove all the columns of df2 that were added, but that is cumbersome. Or is there a way to take away all the columns of df2?
I would appreciate any help.
One option is to use isin on the specific labels common to both, and use the resulting boolean to filter df:
df.loc[df.index.droplevel('C').isin(df2.index)]
V
A B C
a1 b1 c1 1
c2 2
a2 b1 c1 4

How to mark first entry per group satisfying some criterion?

Let's say I have some dataframe where one column has some values occuring multiple times forming groups (column A in the snippet). Now I'd like to create a new column that with e.g. a 1 for the first x (column C) entries per group, and 0 in the other ones.
I managed to do the first part, but I did not find a good way to include the condition on the xes, is there a good way of doing that?
import pandas as pd
df = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"], # data to group by
"B": ["a", "b", "c", "d", "e", "f"], # some other irrelevant data to be preserved
"C": ["y", "x", "y", "x", "y", "x"], # only consider the 'x'
}
)
target = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"],
"B": ["a", "b", "c", "d", "e", "f"],
"C": ["y", "x", "y", "x", "y", "x"],
"D": [ 0, 1, 0, 1, 0, 0] # first entry per group of 'A' that has an 'C' == 'x'
}
)
# following partial solution doesn't account for filtering by 'x' in 'C'
df['D'] = df.groupby('A')['C'].transform(lambda x: [1 if i == 0 else 0 for i in range(len(x))])
In your case do slice then drop_duplicates and assign back
df['D'] = df.loc[df.C=='x'].drop_duplicates('A').assign(D=1)['D']
df['D'].fillna(0,inplace=True)
df
Out[149]:
A B C D
0 0 a y 0.0
1 0 b x 1.0
2 1 c y 0.0
3 2 d x 1.0
4 2 e y 0.0
5 2 f x 0.0

Pandas: How to concat or merge two incomplete dataframe into one more complete dataframe

I would like to concatenate two incomplete data frame with the same data (in theory) regarding a similar index.
I tried with pd.concat but I don't managed to get what I need.
Here is a simple example of what I would like to do :
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B4"],
"C": ["C0", "C1", "C2", "B5"],
"D": [np.nan,np.nan,np.nan,np.nan,]
},
index=[0, 1, 2, 3],)
df2 = pd.DataFrame(
{
"A": ["A0", "A1", "A5", "A6"],
"B": ["B0", "B1", "B5", "B6"],
"C": [np.nan,np.nan,np.nan,np.nan,],
"D": ["D0", "D1", "D5", "D6"],
},
index=[0, 1, 5, 6]
)
res_expected = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3", "A5", "A6"],
"B": ["B0", "B1", "B2", "B3", "B5", "B6"],
"C": ["C0", "C1", "C2", "B5",np.nan,np.nan,],
"D": ["D0", "D1", np.nan,np.nan,"D5", "D6"],
},
index=[0, 1, 2, 3, 5, 6]
)
Does someone have an idea ?
Thanks !
You can use combine_first(), as follows:
df_result = df1.combine_first(df2)
combine_first() works as follows:
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two.
Result:
print(df_result)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 NaN
3 A3 B4 B5 NaN
5 A5 B5 NaN D5
6 A6 B6 NaN D6
res_expected=df1.append(df2,ignore_index=True)
This should work

Is it possible to merge two pandas dataframes based on indices and column names?

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
index=[0, 1, 2, 3],
)
right = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
Is it possible to merge them based on indices and col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C1", "B2", "A3"],
},
)
Try with
left['new'] = right.values[np.arange(len(left)), right.columns.get_indexer(left.Col)]
left
Out[129]:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Notice, we used to have lookup but it deprecated, ,above is one of the alternative of lookup from numpy
The reason here I am not use the index : numpy do not have index, so we need the position to pass by the correct value, most of time index same as position but it will may different from
each other as well.
Another solution:
left["new"] = right.apply(lambda x: x[left.loc[x.name, "Col"]], axis=1)
print(left)
Prints:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Alternative approach (convert columns to index with melt and then merge):
left['id'] = left.index
m = right.melt(ignore_index=False, var_name="Col", value_name="Val")
m['id'] = m.index
result = pd.merge(left, m, on=["id", "Col"])[["Col", "Val"]]
It is faster than use of apply but slower than the accepted answer.

Building new pandas DataFrame using dict of row selections (cannot reindex from a duplicate axis)

I have a pandas DataFrame and several lists of row indices. My goal is to use these row indices to create columns in a new dataset based on the corresponding values in a given the original DataFrame, and make boxplots from this. My lists of row indices are associated with names. I represent this as a dictionary of lists of row indices.
The following small example works as expected:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f"],
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
However, with my real data, I end up with a ValueError: cannot reindex from a duplicate axis.
What can be the problem, and how do I fix it ?
As the error message suggests, some of the lists of indices may have duplicates. Simply transforming them in a set can solve the issue:
Here is an example that reproduces the error:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f", "c"], # Note the extra "c"
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
And here is how to fix it:
new_df = pd.DataFrame(
{list_name : df.loc[set(id_list)]["col1"] for (list_name, id_list) in lists_of_indices.items()})
It might however be worthwhile to check why some of these lists of indices contain duplicates in the first place.

Categories

Resources