I have a list:
my_list = ['a', 'b']
and a pandas dataframe:
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
What can I do to remove the columns in df based on list my_list, in this case remove columns a and b
This is very simple:
df = df.drop(columns=my_list)
drop removes columns by specifying a list of column names
This is a concise script using list comprehension: [df.pop(x) for x in my_list]
my_list = ['a', 'b']
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
print(df.to_markdown())
| | a | b | c | d |
|---:|----:|----:|----:|----:|
| 0 | 1 | 3 | 1 | 3 |
| 1 | 2 | 4 | 2 | 4 |
[df.pop(x) for x in my_list]
print(df.to_markdown())
| | c | d |
|---:|----:|----:|
| 0 | 1 | 3 |
| 1 | 2 | 4 |
You can select required columns as well:
cols_of_interest = ['c', 'd']
df = df[cols_of_interest]
if you have a range of columns to drop: for example 2 to 8, you can use:
df.drop(df.iloc[:,2:8].head(0).columns, axis=1)
Related
I have two dataframes
I want to join on a column where one of the column is a list,
need to join if any value in list matches
df1 =
| index | col_1 |
| ----- | ----- |
| 1 | 'a' |
| 2 | 'b' |
df2 =
| index_2 | col_1 |
| ------- | ----- |
| A | ['a', 'c'] |
| B | ['a', 'd', 'e'] |
I am looking something like
df1.join(df2, on='col_1', type_=any, type='left')
| index |col_1_x |index_2|col_1_y |
| ----- |--------|_______| ----- |
| 1 |'a' | A |['a', 'c'] |
| 1 |'a' | A |['a', 'd', 'e']|
```
You can use explode and then use merge like so:
import pandas as pd
# Create the input dataframes
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# Explode the list column in df2 to multiple rows
df2_exploded = df2.explode('col_1')
# Perform a regular join on the common column
result = df1.merge(df2_exploded, left_on='col_1', right_on='col_1', how='left')
# Get the "col_1" from un-exploded data
result = result.merge(df2, on='index_2', how='left').dropna()
df_exploded looks like this:
index_2 col_1
0 A a
0 A c
1 B a
1 B d
1 B e
The final result looks like this:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]
You can do the following :
import pandas as pd
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# check for matches
def any_match(list1, list2):
if list1 is None or list2 is None:
return False
return any(x in list2 for x in list1)
# join the dataframes based on matching values
result = pd.merge(df1, df2, how='cross')
result = result[result.apply(lambda x: any_match(x['col_1_x'], x['col_1_y']), axis=1)]
print(result[['index', 'col_1_x', 'index_2', 'col_1_y']])
which returns:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]
I have 2 DataFrames in Python Pandas like below:
DF1
COL1 | ... | COLn
-----|------|-------
A | ... | ...
B | ... | ...
A | ... | ...
.... | ... | ...
DF2
G1 | G2
----|-----
A | 1
B | 2
C | 3
D | 4
And I need to replace values from DF1 COL1 by values from DF2 G2
So, as a result I need DF1 in formt like below:
COL1 | ... | COLn
-----|------|-------
1 | ... | ...
2 | ... | ...
1 | ... | ...
.... | ... | ...
Of course my table in huge and it could be good to do that automaticly not by manually adjusting the values :)
How can I do that in Python Pandas?
import pandas as pd
df1 = pd.DataFrame({"COL1": ["A", "B", "A"]}) # Add more columns as required
df2 = pd.DataFrame({"G1": ["A", "B", "C", "D"], "G2": [1, 2, 3, 4]})
df1["COL1"] = df1["COL1"].map(df2.set_index("G1")["G2"])
output df1:
COL1
0 1
1 2
2 1
you could try using the assign or update method of Dataframe:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'B': [7, 8, 9]})
try
df1 = df1.assign(B=df2['B'])# assign will create a new Dataframe
or
df1.update(df2)# update makes a in place modification
here are links to the docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
I would like to fill column b of a dataframe with values from a in case b is nan, and I would like to do it in a method chain, but I cannot figure out how to do this.
The following works
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df["b"] = df[["a", "b"]].ffill(axis=1)["b"]
print(df.to_markdown())
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |
but is not method-chained. Thanks a lot for the help!
This replaces NA in column df.b with values from df.a using fillna instead of ffill:
import numpy as np
import pandas as pd
df = (
pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
.assign(b=lambda x: x.b.fillna(df.a))
)
display(df)
df.dtypes
Output:
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
df['b'] = df.b.fillna(df.a)
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |
One solution I have found is by using the pyjanitor library:
import pandas as pd
import pyjanitor
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df.case_when(
lambda x: x["b"].isna(), lambda x: x["a"], lambda x: x["b"], column_name="b"
)
Here, the case_when(...) can be integrated into a chain of manipulations and we still keep the whole dataframe in the chain.
I wonder how this could be accomplished without pyjanitor.
I have a simple data frame with a few columns
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
what I am trying to achieve is writing to excel with a custom header which can be be multline.
So output of excel would be
| App Input | |
| -------- | -------------- |
| -------- | -------------- |
| A |B |
|data |data |
| 1 | 2 |
| 1 | 3 |
| 4 | 6 |
Any ideas how can I achieve this? I was thinking of mult index but I don't think it will work since its not a true multi index
Since headers in Excel will be simple cells with string values, you can just precede the "real" values in columns with some textual values that together with the dataframe's column names are forming the multi-line header you wish to get.
For example you could use the following values to get the desired result you provided:
df = pd.DataFrame([['', ''], ['A', 'B'], ['data', 'data'], [1, 2], [1, 3], [4, 6]], columns=['App Input', ''])
I've got excel/pandas dataframe/file looking like this:
+------+--------+
| ID | 2nd ID |
+------+--------+
| ID_1 | R_1 |
| ID_1 | R_2 |
| ID_2 | R_3 |
| ID_3 | |
| ID_4 | R_4 |
| ID_5 | |
+------+--------+
How can I transform it to python dictionary? I want my result to be like:
{'ID_1':['R_1','R_2'],'ID_2':['R_3'],'ID_3':[],'ID_4':['R_4'],'ID_5':[]}
What should I do, to obtain it?
If need remove missing values for not exist values use Series.dropna in lambda function in GroupBy.apply:
d = df.groupby('ID')['2nd ID'].apply(lambda x: x.dropna().tolist()).to_dict()
print (d)
{'ID_1': ['R_1', 'R_2'], 'ID_2': ['R_3'], 'ID_3': [], 'ID_4': ['R_4'], 'ID_5': []}
Or use fact np.nan == np.nan return False in list compehension for filter non missing values, check also warning in docs for more explain.
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y == y]).to_dict()
If need remove empty strings:
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y != '']).to_dict()
Apply a function over the dataframe over the rows which appends the value to your dict. Apply is not inplace and thus your dictionary would be created.
d = dict.fromkeys(df.ID.unique(), [])
def func(x):
d[x.ID].append(x["2nd ID"])
# will return a series of Nones
df.apply(func, axis = 1)
Edit:
I asked it on Gitter and #gurukiran07 gave me an answer. What you are trying to do is reverse of explode function
s = pd.Series([[1, 2, 3], [4, 5]])
0 [1, 2, 3]
1 [4, 5]
dtype: object
exploded = s.explode()
0 1
0 2
0 3
1 4
1 5
dtype: object
exploded.groupby(level=0).agg(list)
0 [1, 2, 3]
1 [4, 5]
dtype: object