Pandas: Remove values that meet condition

Pandas: Remove values that meet condition - python

Let's say I have data like this:
df = pd.DataFrame({'category': ["blue","red","blue", "blue","green"], 'val1': [5, 3, 2, 2, 5], 'val2':[1, 3, 2, 2, 5], 'val3': [2, 1, 1, 4, 3]})
print(df)
category val1 val2 val3
0 blue 5 1 2
1 red 3 3 1
2 blue 2 2 1
3 blue 2 2 4
4 green 5 5 3
How do I remove (or replace with for example NaN) values that meet a certain condition without removing the entire row or shift the column?
Let's say my condition is that I want to remove all values below 3 from the above data, the result would have to look like this:
category val1 val2 val3
0 blue 5
1 red 3 3
2 blue
3 blue 4
4 green 5 5 3

Use mask:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
If you want to set particular value, for example 0, do:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3, 0)
print(df)
Output
category val1 val2 val3
0 blue 5 0 0
1 red 3 3 0
2 blue 0 0 0
3 blue 0 0 4
4 green 5 5 3
If you just need a few columns, you could do:
df[['val1', 'val2', 'val3']] = df[['val1', 'val2', 'val3']].mask(df[['val1', 'val2', 'val3']] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0

One approach is to create a mask of the values that don't meet the removal criteria.
mask = df[['val1','val2','val3']] > 3
You can then create a new df, that is just the non-removed vals.
updated_df = df[['val1','val2','val3']][mask]
You need to add back in the unaffected columns.
updated_df['category'] = df['category']

You can use applymap or transform to columns containing integers.
df[df.iloc[:,1:].transform(lambda x: x>=3)].fillna('')

Related

Cumsum with nan values - pandas

I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4

Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Pandas pandas backward fill from another column

I have a df like this:
val1 val2
9 3
2 .
9 4
1 .
5 1
How can I use bfill con val2 but referencing val1, such that the dataframe results in:
val1 val2
9 3
2 9
9 4
1 3
5 1
So the missing values con val2 are the previous value BUT from val1

You can fill NA values in the second column with the first column shifted down one row:
>>> import pandas as pd
>>> df = pd.DataFrame({"val1": [9, 2, 9, 1, 5], "val2": [3, None, 4, None, 1]})
>>> df
val1 val2
0 9 3.0
1 2 NaN
2 9 4.0
3 1 NaN
4 5 1.0
>>> df["val2"].fillna(df["val1"].shift(), inplace=True)
>>> df
val1 val2
0 9 3.0
1 2 9.0
2 9 4.0
3 1 9.0
4 5 1.0

I guess you want ffill not bfill:
STEPS:
Use mask to make values in val1 column NaN.
ffill the val1 column and save the result in the variable m.
fill the NaN values in val2 with m.
m = df.val1.mask(df.val2.isna()).fillna(method ='ffill')
df.val2 = df.val2.fillna(m)
OUTPUT:
val1 val2
0 9 3
1 2 9
2 9 4
3 1 9
4 5 1

How to sum or count groups of multiple columns in pandas

I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE:
I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc...
So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....

You can use pd.wide_to_long:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()

The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15

Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64

You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns

Pandas force NaN to bottom of each column at each index

I have a DataFrame where multiple rows span each index. Taking the first index, for example, has such a structure:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0]],
columns=["ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0])
Out[4]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
I would like to sort/order each column such that the NaNs are at the bottom of each column at that given index - a result which looks like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
A more explicit example might look like this:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0],
["B", "second", 4.0, 4.0, np.NaN],
[np.NaN, np.NaN, 5.0, np.NaN, 5.0],
[np.NaN, np.NaN, np.NaN, 6.0, 6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0, 1, 1, 1])
Out[5]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
with the desired result to look like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
I have many thousands of rows in this dataframe, with each index containing up to a few hundred rows. My desired result will be very helpful when I to_csv the dataframe.
I have attempted to use sort_values(['val1','val2','val3']) on the whole data frame, but this results in the indices becoming disordered. I have tried to iterate through each index and sort in place, but this too does not restrict the NaN to the bottom of each indices' column. I have also tried to fillna to another value, such as 0, but I have not been successful here, either.
While I am certainly using it wrong, the na_position parameter in sort_values does not produce the desired outcome, though it seems this is likely what want.
Edit:
The final df's index is not required to be in numerical order as in my second example.
By changing ignore_index to False in the single line of #Leb's third code block,
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
to
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=False)
and by creating a temp df for all rows in a given index, I was able to make this work - not pretty, but it orders things how I need them. If someone (certainly) has a better way, please let me know.
new_df = df.ix[0]
new_df = pd.concat([new_df[col].sort_values().reset_index(drop=True) for col in new_df], axis=1, ignore_index=False)
max_index = df.index[-1]
for i in range(1, max_index + 1):
tmp = df.ix[i]
tmp = pd.concat([tmp[col].sort_values().reset_index(drop=True) for col in tmp], axis=1, ignore_index=False)
new_df = pd.concat([new_df,tmp])
In [10]: new_df
Out[10]:
ID Name val1 val2 val3
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
0 B second 4 4 5
1 NaN NaN 5 6 6
2 NaN NaN NaN NaN NaN

I know the issue of pushing nans to an edge has been discussed on github. For your particular frame, I'd probably do it manually at the Python level, and not worry about performance much. Something like
>>> df.groupby(level=0, sort=False).transform(lambda x: sorted(x,key=pd.isnull))
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
should work. Note that since sorted is a stable sort, and we're using pd.isnull as the key (where False < True), we push the NaNs to the end while preserving the order of the remaining objects. Also note that here I'm grouping just on the index; we could alternatively have grouped on whatever we wanted.

Given df:
pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,1,2])
I changed index to make sure order stays.
df
Out[127]:
ID Name val1 val2 val3
0 A first 1 1 NaN
1 NaN NaN 2 NaN 2
2 NaN NaN NaN 3 3
Using:
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Will give:
Out[130]:
0 1 2 3 4
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
Same for:
df = pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0],
["B","second",4.0,4.0,np.NaN],
[np.NaN,np.NaN,5.0,np.NaN,5.0],
[np.NaN,np.NaN,np.NaN,6.0,6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,0,0,1,1,1])
df
Out[132]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Out[133]:
0 1 2 3 4
0 A first 1 1 2
1 B second 2 3 3
2 NaN NaN 4 4 5
3 NaN NaN 5 6 6
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
After additional comments
new = pd.concat([df[col].sort_values().reset_index(drop=True) for col in df.iloc[:,2:]], axis=1, ignore_index=True)
new.index = df.index
cols = df.iloc[:,2:].columns
new.columns = cols
df.drop(cols,inplace=True,axis=1)
df = pd.concat([df,new],axis=1)
df
Out[37]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN 4 4 5
1 B second 5 6 6
1 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN

In [219]:
df.groupby(level=0).transform(lambda x : x.sort(na_position = 'last' , inplace = False))
Out[219]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...

with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4

Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4

As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Remove values that meet condition - python

You can use applymap or transform to columns containing integers. df[df.iloc[:,1:].transform(lambda x: x>=3)].fillna('')

Related

Cumsum with nan values - pandas

Pandas pandas backward fill from another column

How to sum or count groups of multiple columns in pandas

Pandas force NaN to bottom of each column at each index

Pandas sum two columns, skipping NaN

Categories

Resources