Pandas pandas backward fill from another column - python

I have a df like this:
val1 val2
9 3
2 .
9 4
1 .
5 1
How can I use bfill con val2 but referencing val1, such that the dataframe results in:
val1 val2
9 3
2 9
9 4
1 3
5 1
So the missing values con val2 are the previous value BUT from val1

You can fill NA values in the second column with the first column shifted down one row:
>>> import pandas as pd
>>> df = pd.DataFrame({"val1": [9, 2, 9, 1, 5], "val2": [3, None, 4, None, 1]})
>>> df
val1 val2
0 9 3.0
1 2 NaN
2 9 4.0
3 1 NaN
4 5 1.0
>>> df["val2"].fillna(df["val1"].shift(), inplace=True)
>>> df
val1 val2
0 9 3.0
1 2 9.0
2 9 4.0
3 1 9.0
4 5 1.0

I guess you want ffill not bfill:
STEPS:
Use mask to make values in val1 column NaN.
ffill the val1 column and save the result in the variable m.
fill the NaN values in val2 with m.
m = df.val1.mask(df.val2.isna()).fillna(method ='ffill')
df.val2 = df.val2.fillna(m)
OUTPUT:
val1 val2
0 9 3
1 2 9
2 9 4
3 1 9
4 5 1

Related

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Pandas: Remove values that meet condition

Let's say I have data like this:
df = pd.DataFrame({'category': ["blue","red","blue", "blue","green"], 'val1': [5, 3, 2, 2, 5], 'val2':[1, 3, 2, 2, 5], 'val3': [2, 1, 1, 4, 3]})
print(df)
category val1 val2 val3
0 blue 5 1 2
1 red 3 3 1
2 blue 2 2 1
3 blue 2 2 4
4 green 5 5 3
How do I remove (or replace with for example NaN) values that meet a certain condition without removing the entire row or shift the column?
Let's say my condition is that I want to remove all values below 3 from the above data, the result would have to look like this:
category val1 val2 val3
0 blue 5
1 red 3 3
2 blue
3 blue 4
4 green 5 5 3
Use mask:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
If you want to set particular value, for example 0, do:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3, 0)
print(df)
Output
category val1 val2 val3
0 blue 5 0 0
1 red 3 3 0
2 blue 0 0 0
3 blue 0 0 4
4 green 5 5 3
If you just need a few columns, you could do:
df[['val1', 'val2', 'val3']] = df[['val1', 'val2', 'val3']].mask(df[['val1', 'val2', 'val3']] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
One approach is to create a mask of the values that don't meet the removal criteria.
mask = df[['val1','val2','val3']] > 3
You can then create a new df, that is just the non-removed vals.
updated_df = df[['val1','val2','val3']][mask]
You need to add back in the unaffected columns.
updated_df['category'] = df['category']
You can use applymap or transform to columns containing integers.
df[df.iloc[:,1:].transform(lambda x: x>=3)].fillna('')

Creating child pandas dataframe from a mother dataframe based on condition

I have a dataframe as follows. Actually this dataframe is made by outer join of two table.
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
Objective: I want a resultant dataframe based on column conditions. The conditions are given below
if ishod == 1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 3 17
if ishod!=1 and isdesignatedhod==1 the resultant df will be:
IndentID IndentNo role_name role_id user_id
100 1234 xyz 100 100
I am really clueless on how to proceed on this. Any clue will be appreciated!!
To select rows based on a value in a certain column you can do use the following notation:
df[ df["column_name"] == value_to_keep ]
Here is an example of this in action:
import pandas as pd
d = {'col1': [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1],
'col2': [3,4,5,3,4,5,3,4,5,3,4,5,3,4,5],
'col3': [6,7,8,9,6,7,8,9,6,7,8,9,6,7,8]}
# create a dataframe
df = pd.DataFrame(d)
This is what df looks like:
In [17]: df
Out[17]:
col1 col2 col3
0 1 3 6
1 2 4 7
2 1 5 8
3 2 3 9
4 1 4 6
5 2 5 7
6 1 3 8
7 2 4 9
8 1 5 6
9 2 3 7
10 1 4 8
11 2 5 9
12 1 3 6
13 2 4 7
14 1 5 8
Now to select all rows for which the value is '2' in the first column:
df_1 = df[df["col1"] == 2]
In [19]: df_1
Out [19]:
col1 col2 col3
1 2 4 7
3 2 3 9
5 2 5 7
7 2 4 9
9 2 3 7
11 2 5 9
13 2 4 7
You can also multiple conditions this way:
df_2 = df[(df["col2"] >= 4) & (df["col3"] != 7)]
In [22]: df_2
Out [22]:
col1 col2 col3
2 1 5 8
4 1 4 6
7 2 4 9
8 1 5 6
10 1 4 8
11 2 5 9
14 1 5 8
Hope this example helps!
Andre gives the right answer. Also you have to keep in mind dtype of columns ishod and isdesignatedhod. They are "object" type, in this specifically case "strings".
So you have to use "quotes" when compare these object columns with numbers.
df[df["ishod"] == "1"]
This should do approximately what you want
nan = float("nan")
def func(row):
if row["ishod"] == "1":
return pd.Series([100, 1234, "xyz", 3, 17, nan, nan, nan], index=row.index)
elif row["isdesignatedhod"] == "1":
return pd.Series([100, 1234, "xyz", 100, 100, nan, nan, nan], index=row.index)
else:
return row
pd.read_csv(io.StringIO(
"""IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
100 1234 xyz 3 17 1 nan right_only
nan nan nan -1 -1 None None right_only
nan nan nan 1 15 None None right_only
nan nan nan 100 100 None 1 right_only
"""), sep=" +", engine='python')\
.apply(func,axis=1)
Output:
IndentID IndentNo role_name role_id user_id ishod isdesignatedhod Flag
0 100.0 1234.0 xyz 3 17 NaN NaN NaN
1 NaN NaN NaN -1 -1 None None right_only
2 NaN NaN NaN 1 15 None None right_only
3 100.0 1234.0 xyz 100 100 NaN NaN NaN

Add series to dataframe which is an intersection of two others

Suppose I have a dataframe like this
column1 column2
1 8
2 9
20 1
4 2
56
6
2
I want a result like this :
column1 column2 column3
1 8 1
2 9 2
20 1
4 2
56
6
2
So I want a result in the column 3
Using set.intersection with pd.DataFrame.loc:
L = list(set(df['column1']) & set(df['column2']))
df.loc[np.arange(len(L)), 'column3'] = L
print(df)
column1 column2 column3
0 1 8.0 1.0
1 2 9.0 2.0
2 20 1.0 NaN
3 4 2.0 NaN
4 56 NaN NaN
5 6 NaN NaN
6 2 NaN NaN
You should be aware this isn't vectorised and somewhat against the grain with Pandas / NumPy, hence a solution which uses regular Python objects.
column = [1, 2, 20, 4, 56, 6, 2]
column = [8, 9, 1, 2]
list_1 = []
for item1 in column1:
for item2 in column2:
if item1 == item2:
list_1.append(item1)
else:
print("NO MATCH")
z = list(set(list_1))
print(z)

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...
with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Categories

Resources