I want to check if any value of column 'c' is smaller than all previous column values.
In my current approach I am using pandas diff(), but it let's me only compare to the previous value.
import pandas as pd
df = pd.DataFrame({'c': [1, 4, 9, 7, 8, 36]})
df['diff'] = df['c'].diff() < 0
print(df)
Current result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 False
5 36 False
Wanted result:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False
So row 4 should also result in a True, as 8 is smaller than 9.
Thanks
This should work:
df['diff'] = df['c'] < df['c'].cummax()
Output is just as you mentioned:
c diff
0 1 False
1 4 False
2 9 False
3 7 True
4 8 True
5 36 False
Related
I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8
I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1
I have a Pandas dataframe that includes a grouping variable. An example can be produced using:
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
...which looks like:
grp data
0 a 4
1 a 5
2 b 3
3 b 6
4 b 7
5 c 8
6 d 9
7 d 8
8 d 7
9 d 3
I can retrieve the last two rows of each group using:
dfgrp = df.groupby('grp').tail(2)
However, I would like to produce a mask that identifies the last two rows (or 1 row if only 1 exists), ideally producing an output that looks like:
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
I thought this would be relatively straight-forward but I haven't been able to find the solution. Suggestions would be greatly appreciated.
If your index is unique, you could do this by using isin.
import pandas as pd
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
df['mask'] = df.index.isin(df.groupby('grp').tail(2).index)
df
grp data mask
0 a 4 True
1 a 5 True
2 b 3 False
3 b 6 True
4 b 7 True
5 c 8 True
6 d 9 False
7 d 8 False
8 d 7 True
9 d 3 True
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)
I would like to get a subset of a pandas dataframe with boolean indexing.
The condition I want to test is like (df[var_0] == value_0) & ... & (df[var_n] == value_n) where the number n of variables involved can change. As a result I am not able to write :
df = df[(df[var_0] == value_0) & ... & (df[var_n] == value_n)]
I could do something like :
for k in range(0,n+1) :
df = df[df[var_k] == value_k]
(with some try catch to make sure it works if the dataframe goes empty), but that does not seems very efficient.
Has anyone an idea on how to write that in a clean pandas formulation ?
The isin method should work for you here.
In [7]: df
Out[7]:
a b c d e
0 6 3 1 9 6
1 8 9 5 7 2
2 6 4 7 4 3
3 4 8 0 0 5
4 4 4 2 3 4
5 2 5 9 0 9
6 4 8 2 9 1
7 3 0 8 9 7
8 0 5 9 9 6
9 0 7 8 4 8
[10 rows x 5 columns]
In [8]: vals = {'a': [3], 'b': [0], 'c': [8], 'd': [9], 'e': [7]}
In [9]: df.isin(vals)
Out[9]:
a b c d e
0 False False False True False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
5 False False False False False
6 False False False True False
7 True True True True True
8 False False False True False
9 False False True False False
[10 rows x 5 columns]
In [10]: df[df.isin(vals).all(1)]
Out[10]:
a b c d e
7 3 0 8 9 7
[1 rows x 5 columns]
The values in the vals dict need to be a collection, so I put them into length 1 lists. It's possibly that query can do this too.