pandas join on index of a particular column - python

I have three lists which look like this:
l1 = ["a", "b" , "c", "d", "e", "f", "g"]
l2 = ["a", "d", "f"]
l3 = ["b", "g"]
I would like to get a dataframe which looks like this:
| l1 | l2 | l3 |
|----|------|------|
| a | a | None |
| b | None | b |
| c | None | None |
| d | d | None |
| e | None | None |
| f | f | None |
| g | None | g |
I have tried to use the join/merge operations but could not figure this out.
How could i accomplish this?

You can do this using list comprehensions:
import pandas as pd
import numpy as np
a = [i if i in l2 else np.nan for i in l1]
b = [i if i in l3 else np.nan for i in l1]
df = pd.DataFrame({'l1': l1, 'l2': a, 'l3': b})
print(df)
Output:
l1 l2 l3
0 a a NaN
1 b NaN b
2 c NaN NaN
3 d d NaN
4 e NaN NaN
5 f f NaN
6 g NaN g

There are a few args in pd.merge that you can use for this purpose: left_on, right_on and how.
left_on allows you to specify which column in the left dataframe you would like to pandas to join on.
right_on is similar to left_on but for right dataframe.
how allows you to specify which type of join you would like to. In this case you probably want to perform a left join.
Learn more on this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
You can do something like this:
l1 = ["a", "b" , "c", "d", "e", "f", "g"]
l2 = ["a", "d", "f"]
l3 = ["b", "g"]
df = pd.DataFrame({'l1': l1})
df_l2 = pd.DataFrame({'l2': l2})
df_l3 = pd.DataFrame({'l3': l3})
df = pd.merge(df, df_l2, left_on='l1', right_on='l2', how='left')
df = pd.merge(df, df_l3, left_on='l1', right_on='l3', how='left')
Output:
l1 l2 l3
0 a a NaN
1 b NaN b
2 c NaN NaN
3 d d NaN
4 e NaN NaN
5 f f NaN
6 g NaN g

Related

How to replace values in column in one DataFrame by values from second DataFrame both have major key in Python Pandas?

I have 2 DataFrames in Python Pandas like below:
DF1
COL1 | ... | COLn
-----|------|-------
A | ... | ...
B | ... | ...
A | ... | ...
.... | ... | ...
DF2
G1 | G2
----|-----
A | 1
B | 2
C | 3
D | 4
And I need to replace values from DF1 COL1 by values from DF2 G2
So, as a result I need DF1 in formt like below:
COL1 | ... | COLn
-----|------|-------
1 | ... | ...
2 | ... | ...
1 | ... | ...
.... | ... | ...
Of course my table in huge and it could be good to do that automaticly not by manually adjusting the values :)
How can I do that in Python Pandas?
import pandas as pd
df1 = pd.DataFrame({"COL1": ["A", "B", "A"]}) # Add more columns as required
df2 = pd.DataFrame({"G1": ["A", "B", "C", "D"], "G2": [1, 2, 3, 4]})
df1["COL1"] = df1["COL1"].map(df2.set_index("G1")["G2"])
output df1:
COL1
0 1
1 2
2 1
you could try using the assign or update method of Dataframe:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'B': [7, 8, 9]})
try
df1 = df1.assign(B=df2['B'])# assign will create a new Dataframe
or
df1.update(df2)# update makes a in place modification
here are links to the docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html

How to method-chain `ffill(axis=1)` in a dataframe

I would like to fill column b of a dataframe with values from a in case b is nan, and I would like to do it in a method chain, but I cannot figure out how to do this.
The following works
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df["b"] = df[["a", "b"]].ffill(axis=1)["b"]
print(df.to_markdown())
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |
but is not method-chained. Thanks a lot for the help!
This replaces NA in column df.b with values from df.a using fillna instead of ffill:
import numpy as np
import pandas as pd
df = (
pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
.assign(b=lambda x: x.b.fillna(df.a))
)
display(df)
df.dtypes
Output:
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
df['b'] = df.b.fillna(df.a)
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |
One solution I have found is by using the pyjanitor library:
import pandas as pd
import pyjanitor
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df.case_when(
lambda x: x["b"].isna(), lambda x: x["a"], lambda x: x["b"], column_name="b"
)
Here, the case_when(...) can be integrated into a chain of manipulations and we still keep the whole dataframe in the chain.
I wonder how this could be accomplished without pyjanitor.

How to compare 2 dataframes and then output an ID to inform if a row has changed?

I have 2 dataframes:
df1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"], "id3": ["33", "232", "343", "555"]})
df2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "2", "2", "1", "1", "2"], "id3": ["33", "11", "77", "99", "555","88"]})
I would like to get an output which tells me which rows of df2 have been modified or not (Y for yes and N for No) such as the following:
id1
Modified_ID
0
A
N
1
B
Y
2
F
N
3
C
Y
4
D
N
5
E
N
You can merge and build a boolean mask that returns True if a value didn't change (or was expanded) and map Y/N values according to it:
df2 = df2.merge(df1, on='id1', how='left', suffixes=('_',''))
df2['Modified_ID'] = (df2['id2_'].eq(df2['id2']) | df2['id2'].isna()).map({True:'N', False:'Y'})
df2 = df2.drop(columns=['id2_','id2'])
Output:
id1 Modified_ID
0 A N
1 B Y
2 F N
3 C Y
4 D N
5 E N
For more than 1 column, use:
df2 = df2.merge(df1, on='id1', how='left', suffixes=('_',''))
cols = ['id2','id3']
df2['Modified_ID'] = (df2[cols].eq(df2[[f'{c}_' for c in cols]].to_numpy()).all(axis=1) | df2[cols].isna().all(axis=1)).map({True:'N', False:'Y'})
df2 = df2.drop(columns=['id2_','id2','id3_','id3'])
Output:
id1 Modified_ID
0 A N
1 B Y
2 F N
3 C Y
4 D N
5 E N

How to split columns when content isn't aligned

I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?
Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")
I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)

Get all combinations of elements from two lists?

If I have two lists
l1 = ['A', 'B']
l2 = [1, 2]
what is the most elegant way to get a pandas data frame which looks like:
+-----+-----+-----+
| | l1 | l2 |
+-----+-----+-----+
| 0 | A | 1 |
+-----+-----+-----+
| 1 | A | 2 |
+-----+-----+-----+
| 2 | B | 1 |
+-----+-----+-----+
| 3 | B | 2 |
+-----+-----+-----+
Note, the first column is the index.
use product from itertools:
>>> from itertools import product
>>> pd.DataFrame(list(product(l1, l2)), columns=['l1', 'l2'])
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2
As an alternative you can use pandas' cartesian_product (may be more useful with large numpy arrays):
In [11]: lp1, lp2 = pd.core.reshape.util.cartesian_product([l1, l2])
In [12]: pd.DataFrame(dict(l1=lp1, l2=lp2))
Out[12]:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2
This seems a little messy to read in to a DataFrame with the correct orient...
Note: previously cartesian_product was located at pd.core.reshape.util.cartesian_product.
You can also use the sklearn library, which uses a NumPy-based approach:
from sklearn.utils.extmath import cartesian
df = pd.DataFrame(cartesian((L1, L2)))
For more verbose but possibly more efficient variants see Numpy: cartesian product of x and y array points into single array of 2D points.
You can use the function merge:
df1 = pd.DataFrame(l1, columns=['l1'])
df2 = pd.DataFrame(l2, columns=['l2'])
df1.merge(df2, how='cross')
Output:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2

Categories

Resources