How to method-chain `ffill(axis=1)` in a dataframe - python

I would like to fill column b of a dataframe with values from a in case b is nan, and I would like to do it in a method chain, but I cannot figure out how to do this.
The following works
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df["b"] = df[["a", "b"]].ffill(axis=1)["b"]
print(df.to_markdown())
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |
but is not method-chained. Thanks a lot for the help!

This replaces NA in column df.b with values from df.a using fillna instead of ffill:
import numpy as np
import pandas as pd
df = (
pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
.assign(b=lambda x: x.b.fillna(df.a))
)
display(df)
df.dtypes
Output:

df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]})
df['b'] = df.b.fillna(df.a)
| | a | b | c |
|---:|----:|----:|:----|
| 0 | 1 | 10 | a |
| 1 | 2 | 2 | b |
| 2 | 3 | 3 | c |
| 3 | 4 | 40 | d |

One solution I have found is by using the pyjanitor library:
import pandas as pd
import pyjanitor
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [10, np.nan, np.nan, 40], "c": ["a", "b", "c", "d"]}
)
df.case_when(
lambda x: x["b"].isna(), lambda x: x["a"], lambda x: x["b"], column_name="b"
)
Here, the case_when(...) can be integrated into a chain of manipulations and we still keep the whole dataframe in the chain.
I wonder how this could be accomplished without pyjanitor.

Related

How to replace values in column in one DataFrame by values from second DataFrame both have major key in Python Pandas?

I have 2 DataFrames in Python Pandas like below:
DF1
COL1 | ... | COLn
-----|------|-------
A | ... | ...
B | ... | ...
A | ... | ...
.... | ... | ...
DF2
G1 | G2
----|-----
A | 1
B | 2
C | 3
D | 4
And I need to replace values from DF1 COL1 by values from DF2 G2
So, as a result I need DF1 in formt like below:
COL1 | ... | COLn
-----|------|-------
1 | ... | ...
2 | ... | ...
1 | ... | ...
.... | ... | ...
Of course my table in huge and it could be good to do that automaticly not by manually adjusting the values :)
How can I do that in Python Pandas?
import pandas as pd
df1 = pd.DataFrame({"COL1": ["A", "B", "A"]}) # Add more columns as required
df2 = pd.DataFrame({"G1": ["A", "B", "C", "D"], "G2": [1, 2, 3, 4]})
df1["COL1"] = df1["COL1"].map(df2.set_index("G1")["G2"])
output df1:
COL1
0 1
1 2
2 1
you could try using the assign or update method of Dataframe:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'B': [7, 8, 9]})
try
df1 = df1.assign(B=df2['B'])# assign will create a new Dataframe
or
df1.update(df2)# update makes a in place modification
here are links to the docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html

Consequently subtract columns and get the resulted cumsum

Having the dataframe like this:
|one|two|three|
| 1 | 2 | 4 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
How can I subtract values from column one from values of column two and so on and then get the sum of obtained values? Like
|one|two|three| one-two | one-three | two-three | SUM |
| 1 | 2 | 4 | -1 | -3 | -2 | -6 |
| 4 | 6 | 3 |
| 2 | 4 | 9 |
As a result I need a df with only one-three columns and SUM onley
You can try this:
from itertools import combinations
import pandas as pd
df = pd.DataFrame({'one': {0: 1, 1: 4, 2: 2},
'two': {0: 2, 1: 6, 2: 4},
'three': {0: 4, 1: 3, 2: 9}})
create column combination using itertools.combinations
## create column combinations
column_combinations = list(combinations(list(df.columns), 2))
Subtract each combination column and create new column
column_names = []
for column_comb in column_combinations:
name = f"{column_comb[0]}_{column_comb[1]}"
df[name] = df[column_comb[0]] - df[column_comb[1]]
column_names.append(name)
df["SUM"] = df[column_names].sum(axis=1)
print(df)
output:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
column_differences = df.apply(np.diff, axis=1)
total_column_differences = np.sum(column_differences.to_list(), axis=1)
df['SUM'] = total_column_differences
print(df)
Gives the following.
a b c SUM
0 1 4 7 6
1 2 5 8 6
2 3 6 9 6
DataFrames make operations like this very easy
df['new_column'] = df['colA'] - df['colB']
PyData has a great resource to learn more.
In your example:
import pandas as pd
df = pd.DataFrame(data=[[1,2,4],[4,6,3], [2,4,9]], columns=['one', 'two', 'three'])
df['one-two'] = df['one'] - df['two']
df['one-three'] = df['one'] - df['three']
df['two-three'] = df['two'] - df['three']
df['sum'] = df['one-two'] + df['one-three'] + df['two-three']
df.drop(columns=['one', 'two', 'three'], inplace=True)
# print(df)
one-two one-three two-three sum
0 -1 -3 -2 -6
1 -2 1 3 2
2 -2 -7 -5 -14
Assuming you have only 3 columns and only want pairs of 2 features:
from itertools import combinations
from pandas import DataFrame
# Creating the DataFrame
df = DataFrame({'one': [1,4,2], 'two': [2,6,4], 'three': [4,3,9]})
# Getting the possible feature combinations
combs = combinations(df.columns, 2)
# Calculating the totals for the column pairs
for comb in combs:
df['-'.join(comb)] = df[comb[0]] - df[comb[1]]
# Adding the totals to the DataFrame
df['SUM'] = df[df.columns[3:]].sum(axis=1)
one two three one-two one-three two-three SUM
0 1 2 4 -1 -3 -2 -6
1 4 6 3 -2 1 3 2
2 2 4 9 -2 -7 -5 -14

Remove pandas columns based on list

I have a list:
my_list = ['a', 'b']
and a pandas dataframe:
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
What can I do to remove the columns in df based on list my_list, in this case remove columns a and b
This is very simple:
df = df.drop(columns=my_list)
drop removes columns by specifying a list of column names
This is a concise script using list comprehension: [df.pop(x) for x in my_list]
my_list = ['a', 'b']
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
print(df.to_markdown())
| | a | b | c | d |
|---:|----:|----:|----:|----:|
| 0 | 1 | 3 | 1 | 3 |
| 1 | 2 | 4 | 2 | 4 |
[df.pop(x) for x in my_list]
print(df.to_markdown())
| | c | d |
|---:|----:|----:|
| 0 | 1 | 3 |
| 1 | 2 | 4 |
You can select required columns as well:
cols_of_interest = ['c', 'd']
df = df[cols_of_interest]
if you have a range of columns to drop: for example 2 to 8, you can use:
df.drop(df.iloc[:,2:8].head(0).columns, axis=1)

pandas join on index of a particular column

I have three lists which look like this:
l1 = ["a", "b" , "c", "d", "e", "f", "g"]
l2 = ["a", "d", "f"]
l3 = ["b", "g"]
I would like to get a dataframe which looks like this:
| l1 | l2 | l3 |
|----|------|------|
| a | a | None |
| b | None | b |
| c | None | None |
| d | d | None |
| e | None | None |
| f | f | None |
| g | None | g |
I have tried to use the join/merge operations but could not figure this out.
How could i accomplish this?
You can do this using list comprehensions:
import pandas as pd
import numpy as np
a = [i if i in l2 else np.nan for i in l1]
b = [i if i in l3 else np.nan for i in l1]
df = pd.DataFrame({'l1': l1, 'l2': a, 'l3': b})
print(df)
Output:
l1 l2 l3
0 a a NaN
1 b NaN b
2 c NaN NaN
3 d d NaN
4 e NaN NaN
5 f f NaN
6 g NaN g
There are a few args in pd.merge that you can use for this purpose: left_on, right_on and how.
left_on allows you to specify which column in the left dataframe you would like to pandas to join on.
right_on is similar to left_on but for right dataframe.
how allows you to specify which type of join you would like to. In this case you probably want to perform a left join.
Learn more on this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
You can do something like this:
l1 = ["a", "b" , "c", "d", "e", "f", "g"]
l2 = ["a", "d", "f"]
l3 = ["b", "g"]
df = pd.DataFrame({'l1': l1})
df_l2 = pd.DataFrame({'l2': l2})
df_l3 = pd.DataFrame({'l3': l3})
df = pd.merge(df, df_l2, left_on='l1', right_on='l2', how='left')
df = pd.merge(df, df_l3, left_on='l1', right_on='l3', how='left')
Output:
l1 l2 l3
0 a a NaN
1 b NaN b
2 c NaN NaN
3 d d NaN
4 e NaN NaN
5 f f NaN
6 g NaN g

Plot in python after crosstab merge

I'd like to plot my DataFrame. I had this DF first:
id|project|categories|rating
1 | a | A | 1
1 | a | B | 1
1 | a | C | 2
1 | b | A | 1
1 | b | B | 1
2 | c | A | 1
2 | c | B | 2
used this code:
import pandas as pd
df = pd.DataFrame(...)
(df.groupby('id').project.nunique().reset_index()
.merge(pd.crosstab(df.id, df.categories).reset_index()))
and now got this DataFrame:
id | project | A | B | C |
1 | 2 | 2 | 2 | 1 |
2 | 1 | 1 | 1 | 0 |
Now I'd like to plot the DF. I want to show, if the number of projects depends on how many categories are affected, or which categories are affected. I know how to visualize dataframes, but after crosstab and merging, it is not working as usual
I reproduced your data using below code:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2,],\
'project': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],\
'categories': ['A', 'B', 'C', 'A', 'B', 'A', 'B'],\
'rating': [1, 1, 2, 1, 1, 1, 2]})
Now data looks like this
categories id project rating
0 A 1 a 1
1 B 1 a 1
2 C 1 a 2
3 A 1 b 1
4 B 1 b 1
5 A 2 c 1
6 B 2 c 2
If you want to plot 'category count' as a function of 'project count' it looks like this.
import matplotlib.pyplot as plt
# this line is your code
df2 = df.groupby('id').project.nunique().reset_index().merge(pd.crosstab(df.id, df.categories).reset_index())
plt.scatter(df2.project, df2.A, label='A', alpha=0.5)
plt.scatter(df2.project, df2.B, label='B', alpha=0.5)
plt.scatter(df2.project, df2.C, label='C', alpha=0.5)
plt.xlabel('project count')
plt.ylabel('category count')
plt.legend()
plt.show()
And you will get this

Categories

Resources