conditions inside conditions pandas - python

below is my DF in which I want to create a column based on other columns
test = pd.DataFrame({"Year_2017" : [np.nan, np.nan, np.nan, 4], "Year_2018" : [np.nan, np.nan, 3, np.nan], "Year_2019" : [np.nan, 2, np.nan, np.nan], "Year_2020" : [1, np.nan, np.nan, np.nan]})
Year_2017 Year_2018 Year_2019 Year_2020
0 NaN NaN NaN 1
1 NaN NaN 2 NaN
2 NaN 3 NaN NaN
3 4 NaN NaN NaN
The aim will be to create a new column and take value of the columns which is notna()
Below is what I tried without success..
test['Final'] = np.where(test.Year_2017.isna(), test.Year_2018,
np.where(test.Year_2018.isna(), test.Year_2019,
np.where(test.Year_2019.isna(), test.Year_2020, test.Year_2019)))
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 NaN
1 NaN NaN 2 NaN NaN
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN NaN
The expected output:
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 1
1 NaN NaN 2 NaN 2
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN 4

You can forward or back filling missing values and then select last or first column:
test['Final'] = test.ffill(axis=1).iloc[:, -1]
test['Final'] = test.bfill(axis=1).iloc[:, 0]
If there is only one non missing values per rows and numeric use:
test['Final'] = test.min(1)
test['Final'] = test.max(1)
test['Final'] = test.mean(1)
test['Final'] = test.sum(1, min_count=1)

I you only have a single non NA value per row, you can use:
df['Final'] = test.max(axis=1)
(or other aggregators)

Related

Combine multiple categorical columns into one, when each row has only one non-NaN value, in Pandas

I have
import pandas as pd
import numpy as np
df = pd.DataFrame({"x": ["red", "blue", np.nan, np.nan, np.nan, np.nan, np.nan, ],
"y": [np.nan, np.nan, np.nan, 'cold', 'warm', np.nan, np.nan, ],
"z": [np.nan, np.nan, np.nan, np.nan, np.nan, 'charm', 'strange'],
}).astype("category")
giving
x y z
0 red NaN NaN
1 blue NaN NaN
2 NaN NaN NaN
3 NaN cold NaN
4 NaN warm NaN
5 NaN NaN charm
6 NaN NaN strange
I would like to add a new categorical column with unordered values red,blue,hot,cold,warm, charm, strange, filled in appropriately. I have many such columns, not just three.
Some possiblities:
astype(str) and concatenating and then re-creating a categorical
creating a new categorical type using union_categoricals and then cast each column to that type? and then serially fillna() them?
I can't make those or anything else work.
Notes:
using .astype(pd.CategoricalDtype(ordered=True)) in place of .astype("category") in defining df also works with the answer below.
New Solution
For the purpose of using for a large datasets, the following solution may be more efficient:
def my_fun(x):
m = ~ pd.isnull(x)
if m.any():
return x[m]
else:
return np.nan
df['new'] = np.apply_along_axis(my_fun, 1, df.to_numpy())
x y z new
0 red NaN NaN red
1 blue NaN NaN blue
2 NaN NaN NaN NaN
3 NaN cold NaN cold
4 NaN warm NaN warm
5 NaN NaN charm charm
6 NaN NaN strange strange
Edited answer
As specified by the OP, in case there are rows where all values are np.NaN we could try the following solution:
df['new_col'] = df.dropna(how='all').apply(lambda x: x.loc[x.first_valid_index()], axis=1)
df['new_col'] = pd.Categorical(df.new_col)
df
x y z new_col
0 red NaN NaN red
1 blue NaN NaN blue
2 NaN NaN NaN NaN
3 NaN cold NaN cold
4 NaN warm NaN warm
5 NaN NaN charm charm
6 NaN NaN strange strange
Try ffill()
df['col'] = df.ffill(axis=1).iloc[:,-1].astype('category')
or stack() with groupby()
df['col'] = df.stack().groupby(level=0).first().astype('category')
Output:
x y z col
0 red NaN NaN red
1 blue NaN NaN blue
2 NaN NaN NaN NaN
3 NaN cold NaN cold
4 NaN warm NaN warm
5 NaN NaN charm charm
6 NaN NaN strange strange

Combining two dataframes of dates and values

I have two dataframes loaded from a .csv file. One contains numeric values, the other dates (month-year) for when these numeric values occured. The dates and values are basically mapped to each other. I would like to combine/merge these dataframes to have the dates as the column, and values as the rows. However, as you can see, the dates, though ordered from left to right, they don't all start on the same month.
import pandas as pd
df1 = pd.DataFrame(
[
[1, 2, pd.NA, pd.NA, pd.NA],
[2, 3, 4, pd.NA, pd.NA],
[4, 5, 6, pd.NA, pd.NA],
[5, 6, 12, 14, 15]
]
)
df2 = pd.DataFrame(
[
["2021-01", "2021-02", pd.NA, pd.NA, pd.NA],
["2021-02", "2021-03", "2021-04", pd.NA, pd.NA],
["2022-03", "2022-04", "2022-05", pd.NA, pd.NA],
["2021-04", "2021-05", "2021-06", "2021-07", "2021-08"]
]
)
df1
df2
Although I managed to create the combined dataframe, the dataframes, df1 and df2 contain ~300k rows, and the approach I thought of is rather slow. Is there a more efficient way of achieving the same result?
q = {z: {x: y for x, y in zip(df2.values[z], df1.values[z]) if not pd.isna(y)} for z in range(len(df2))}
df = pd.DataFrame.from_dict(q, orient='index')
idx = pd.to_datetime(df.columns, errors='coerce', format='%Y-%m').argsort()
df.iloc[:, idx]
df3 (result)
You can stack, concat and pivot:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.reset_index(level=0)
.pivot(index='level_0', columns=1, values=0)
.rename_axis(index=None, columns=None)
)
Alternative with unstack:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.droplevel(1).set_index(1, append=True)
[0].unstack(1)
.rename_axis(columns=None)
)
output:
2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 2022-03 2022-04 2022-05
0 1 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
3 NaN NaN NaN 5 6 12 14 15 NaN NaN NaN
Use concat with keys parameters, so possible after DataFrame.stack and convert MutiIndex to column use DataFrame.pivot:
df = (pd.concat([df1, df2], axis=1, keys=['a','b'])
.stack()
.reset_index()
.pivot('level_0','b','a'))
print (df)
b 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 \
level_0
0 1 2 NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN 5 6 12 14 15
b 2022-03 2022-04 2022-05
level_0
0 NaN NaN NaN
1 NaN NaN NaN
2 4 5 6
3 NaN NaN NaN

Set Column Value Based on Calculate Condition from Each Row

I have a empty dataframe as
columns_name = list(str(i) for i in range(10))
dfa = pd.DataFrame(columns=columns_name, index=['A', 'B', 'C', 'D'])
dfa['Count'] = [10, 6, 9, 4]
0
1
2
3
4
5
6
7
8
9
Count
A
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
10
B
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
6
C
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9
D
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
I want to replace Nan values with a symbol with the difference of max(Count) - Current(max).
So, the final result will look like.
0
1
2
3
4
5
6
7
8
9
Count
A
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
10
B
NaN
NaN
NaN
NaN
NaN
NaN
-
-
-
-
6
C
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
-
9
D
NaN
NaN
NaN
NaN
-
-
-
-
-
-
4
I am stuck at
dfa.at[dfa.index, [str(col) for col in list(range(dfa['Count'].max() - dfa['Count']))]] = '-'
and getting KeyError: 'Count'
Actually, your this part of the code dfa.at[dfa.index, [str(col) for col in list(range(dfa['Count'].max() - dfa['Count']))]] = '-' has issue.
Just try to create the list which you are trying to use inside comprehension
list(range(dfa['Count'].max() - dfa['Count']))
It'll throw TypeError
If you notice, you'll figure out that (dfa['Count'].max() - dfa['Count']) will give following series:
A 0
B 4
C 1
D 6
And since you're trying to pass a series to python's range function, it will throw the error.
One possible solution might be:
for index, cols in zip(dfa.index, [list(map(str, col)) for col in (dfa).apply(lambda x: list(range(x['Count'], dfa['Count'].max())), axis=1).values]):
dfa.loc[index, cols] = '-'
OUTPUT:
Out[315]:
0 1 2 3 4 5 6 7 8 9 Count
A NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10
B NaN NaN NaN NaN NaN NaN - - - - 6
C NaN NaN NaN NaN NaN NaN NaN NaN NaN - 9
D NaN NaN NaN NaN - - - - - - 4
Broadcasting is also an option:
import pandas as pd
import numpy as np
columns_name = list(str(i) for i in range(10))
dfa = pd.DataFrame(columns=columns_name, index=['A', 'B', 'C', 'D'])
dfa['Count'] = [10, 6, 9, 4]
# Broadcast based on column index (Excluding Count)
m = (
dfa['Count'].to_numpy()[:, None] == np.arange(0, dfa.shape[1] - 1)
).cumsum(axis=1).astype(bool)
# Grab Columns To Update
non_count_columns = dfa.columns[dfa.columns != 'Count']
# Update based on mask
dfa[non_count_columns] = dfa[non_count_columns].mask(m, '-')
print(dfa)
Output:
0 1 2 3 4 5 6 7 8 9 Count
A NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10
B NaN NaN NaN NaN NaN NaN - - - - 6
C NaN NaN NaN NaN NaN NaN NaN NaN NaN - 9
D NaN NaN NaN NaN - - - - - - 4

Fill missing value by averaging previous row value

I want to fill missing value with the average of previous N row value, example is shown below:
N=2
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
DataFrame is like:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN NaN
Result should be:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN (4+2)/2 NaN 5
3 NaN 3.0 NaN (1+5)/2
I am wondering if there is elegant and fast way to achieve this without for loop.
rolling + mean + shift
You will need to modify the below logic to interpret the mean of NaN and another value, in the case where one of the previous two values are null.
df = df.fillna(df.rolling(2).mean().shift())
print(df)
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 3.0 NaN 5.0
3 NaN 3.0 NaN 3.0

pandas - checking a condition for each group in a dataframe

I have got a dataframe:
df = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2 = df.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
df2['A']['a']['x'][4] = 1
df2['B']['a']['x'][3] = 1
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN 1 1
4 1 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
Now I want to check for simultaneous occurrences of x == 1 and y == 1, but only within each subgroup, defined by variable1 and variable2. So, for the dataframe shown above, the condition is met for index == 4 (group A-a), but not for index == 3 (groups B-a and B-b).
I suppose some groupby() magic would be needed, but I cannot find the right way. I have also tried experimenting with a stacked dataframe (using df.stack()), but this did not get me any closer...
You can use groupby on the 2 first levels variable1 and variable2 to get the sum of the x and y columns at that level:
r = df2.groupby(level=[0,1], axis=1).sum()
r
Out[50]:
variable1 A B
variable2 a b a b
index
0 1 NaN NaN NaN
1 NaN 0 NaN NaN
2 NaN NaN 0 NaN
3 NaN NaN 1 1
4 2 NaN NaN NaN
5 NaN NaN NaN 0
6 NaN NaN 0 NaN
7 NaN 1 NaN NaN
Consequently, the lines you are searching for are the ones that contain the value 2:
r[r==2].dropna(how='all')
Out[53]:
variable1 A B
variable2 a b a b
index
4 2 NaN NaN NaN

Categories

Resources