Using Certain Conditions to Find Certain Parts of a Dataframe

Using Certain Conditions to Find Certain Parts of a Dataframe - python

I have a dataframe that looks like this:
>>> data = {'Count':[15, 21, 1, 7, 6, 1, 25, 8, 56, 0, 5, 9, 0, 12, 12, 8, 7, 12, 0, 8]}
>>> df = pd.DataFrame(data)
>>> df
Count
0 15
1 21
2 1
3 7
4 6
5 1
6 25
7 8
8 56
9 0
10 5
11 9
12 0
13 12
14 12
15 8
16 7
17 12
18 0
19 8
I need to add two columns to this df to detect "floods". "Flood" is defined as from the row where 'Count' goes above 10 and until 'Count' drops below 5.
So, in this case, I want this as a result:
Count Flood FloodNumber
0 15 True 1
1 21 True 1
2 1 False 0
3 7 False 0
4 6 False 0
5 1 False 0
6 25 True 2
7 8 True 2
8 56 True 2
9 0 False 0
10 5 False 0
11 9 False 0
12 0 False 0
13 12 True 3
14 12 True 3
15 8 True 3
16 7 True 3
17 12 True 3
18 0 False 0
19 8 False 0
I managed to add my 'Flood' column with a simple loop like this:
df.loc[0, 'Flood'] = (df.loc[0, 'Count'] > 10)
for index in range(1, len(df)):
df.loc[index, 'Flood'] = ((df.loc[index, 'Count'] > 10) | ((df.loc[index-1, 'Flood']) & (df.loc[index, 'Count'] > 5)))
, but this seems like an extremly slow and stupid way of doing this. Is there any "proper" way of doing it using pandas functions rather than loops?

To find Flood flags, we can play with masks and ffill().
df['Flood'] = ((df.Count > 10).where(df.Count > 10)
.fillna((df.Count > 5)
.where(df.Count < 5))
.ffill()
.astype(bool))
To get the FloodNumber, let's ignore all rows which are False in the Flood column and groupby+cumsum
s = df.Flood.where(df.Flood)
df.loc[:, 'FloodNumber'] = s.dropna().groupby((s != s.shift(1)).cumsum()).ngroup().add(1)
Outputs
Count Flood FloodNumber
0 15 True 1.0
1 21 True 1.0
2 1 False NaN
3 7 False NaN
4 6 False NaN
5 1 False NaN
6 25 True 2.0
7 8 True 2.0
8 56 True 2.0
9 0 False NaN
10 5 False NaN
11 9 False NaN
12 0 False NaN
13 12 True 3.0
14 12 True 3.0
15 8 True 3.0
16 7 True 3.0
17 12 True 3.0
18 0 False NaN
19 8 False NaN

Related

why i cant use np.isnan to filter dataframe?

I have some dataframes, which contain a lot of nan.
i want to make a mask by the frist dataframe, then only keep those columns which contains no np.nan in the first datafame.
let me give an example:
In [69]: df = pd.DataFrame(np.reshape(range(25), (5,5)))
In [70]: df
Out[70]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [71]: df[5] = np.nan
In [72]: df
Out[72]:
0 1 2 3 4 5
0 0 1 2 3 4 NaN
1 5 6 7 8 9 NaN
2 10 11 12 13 14 NaN
3 15 16 17 18 19 NaN
4 20 21 22 23 24 NaN
### the following is the mask
In [73]: np.isnan(df)
Out[73]:
0 1 2 3 4 5
0 False False False False False True
1 False False False False False True
2 False False False False False True
3 False False False False False True
4 False False False False False True
In [74]: df[~np.isnan(df)]
Out[74]:
0 1 2 3 4 5
0 0 1 2 3 4 NaN
1 5 6 7 8 9 NaN
2 10 11 12 13 14 NaN
3 15 16 17 18 19 NaN
4 20 21 22 23 24 NaN
you can see, i use np.isnan to create a mask.
then use df[mask] to filter.
but it looks failed, the output still contains column5. is there anything i used wrongly?

EDIT:
If not working any solution below, it means there are no missing values, only strings nans and not np.nans.
So possible solution is replace them:
df = df.replace('nan', np.nan)
You can use it, but cannot filter by it, need Series or 1d mask add DataFrame.all for test ig no values are missing values per rows (also added ~ for inverted mask).
So for filter rows with no NaNs use:
df[~np.isnan(df).all(axis=1)]
Btw, in pandas it is simplier - remove all rows with at least one NaN per rows:
df = df.dropna()
If need filter rows with at least one NaN:
df[np.isnan(df).any(axis=1)]

because you cannot map matrix in elementwise approach... you can remove either rows or columns:
df[~np.isnan(df).all(axis=1)]

Pandas: Select all rows between two rows

I want to get all the rows in a dataset that are between two rows where a certain value is met. Is it possible to do that? I cannot sort the dataset because then all the crucial information will be lost.
Edit:
The dataset contains data as such:
Index| game_clock| quarter | event_type
0 | 711 | 1 | 1
1 | 710 | 1 | 3
2 | 709 | 2 | 4
3 | 708 | 3 | 2
4 | 707 | 4 | 4
5 | 706 | 4 | 1
I want to slice the dataset so that I get subsets of all the rows that are between event_type (1 or 2) and (1 or 2).
Edit 2:
Suppose the dataset is as follows:
A B
0 1 0.278179
1 2 0.069914
2 2 0.633110
3 4 0.584766
4 3 0.581232
5 3 0.677205
6 3 0.687155
7 1 0.438927
8 4 0.320927
9 3 0.570552
10 3 0.479849
11 1 0.861074
12 3 0.834805
13 4 0.105766
14 1 0.060408
15 4 0.596882
16 1 0.792395
17 3 0.226356
18 4 0.535201
19 1 0.136066
20 1 0.372244
21 1 0.151977
22 4 0.429822
23 1 0.792706
24 2 0.406957
25 1 0.177850
26 1 0.909252
27 1 0.545331
28 4 0.100497
29 2 0.718721
The subsets I would like to get are indexed as:
[0], [1], [2], [3:8], [8:12],
[12:15], [15:20], [20], [21], [22:24], [24], [25], [26], [27], [28: ]

I believe you need:
a = pd.factorize(df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index())[0]
print (a)
[ 0 1 2 3 3 3 3 3 4 4 4 4 5 5 5 6 6 7 7 7 8 9 10 10 11
12 13 14 15 15]
dfs = dict(tuple(df.groupby(a)))
print (dfs[0])
A B
0 1 0.278179
print (dfs[1])
A B
1 2 0.069914
print (dfs[2])
A B
2 2 0.63311
print (dfs[3])
A B
3 4 0.584766
4 3 0.581232
5 3 0.677205
6 3 0.687155
7 1 0.438927
print (dfs[4])
A B
8 4 0.320927
9 3 0.570552
10 3 0.479849
11 1 0.861074
Explanation:
#check values to boolean mask
a = df['A'].isin([1,2])
#reverse Series
b = df['A'].isin([1,2]).iloc[::-1]
#cumulative sum
c = df['A'].isin([1,2]).iloc[::-1].cumsum()
#get original order
d = df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index()
#factorize for keys in dictionary of DataFrames
e = pd.factorize(df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index())[0]
df = pd.concat([a,pd.Series(b.values),pd.Series(c.values),d,pd.Series(e)],
axis=1, keys=list('abcde'))
print (df)
a b c d e
0 True True 1 16 0
1 True False 1 15 1
2 True True 2 14 2
3 False True 3 13 3
4 False True 4 13 3
5 False True 5 13 3
6 False True 6 13 3
7 True False 6 13 3
8 False True 7 12 4
9 False True 8 12 4
10 False True 9 12 4
11 True False 9 12 4
12 False False 9 11 5
13 False True 10 11 5
14 True False 10 11 5
15 False True 11 10 6
16 True False 11 10 6
17 False False 11 9 7
18 False True 12 9 7
19 True False 12 9 7
20 True False 12 8 8
21 True False 12 7 9
22 False True 13 6 10
23 True False 13 6 10
24 True False 13 5 11
25 True False 13 4 12
26 True False 13 3 13
27 True True 14 2 14
28 False True 15 1 15
29 True True 16 1 15

That list still doesn't make sense. Sometimes you include first occurence, sometimes not. Try this:
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame({'A': np.random.choice([1,2,3,4], 30), 'B':np.random.rand(30)})
ar = np.where(df.A.isin((1,2)))[0]
ids = list(zip(ar,ar[1:]))
for item in ids:
print(df.iloc[item[0]:item[1],:])
ids are now:
[(0, 1), (1, 2), (2, 7), (7, 11), (11, 14), (14, 16), (16, 19), (19, 20),
(20, 21), (21, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 29)]
This will include 1 or 2 in the start and stop at 1,2 in the end.

How to "iron out" a column of numbers with duplicates in it

If one has the following column:
df = pd.DataFrame({"numbers":[1,2,3,4,4,5,1,2,2,3,4,5,6,7,7,8,1,1,2,2,3,4,5,6,6,7]})
How can one "iron" it out so that the duplicates become part of the series of numbers:
numbers new_numbers
1 1
2 2
3 3
4 4
4 5
5 6
1 1
2 2
2 3
3 4
4 5
5 6
6 7
7 8
7 9
8 10
1 1
1 2
2 3
2 4
3 5
4 6
5 7
6 8
6 9
7 10
(I put spaces into the df for clarification)

It seems you need cumcount by Series created with diff and compare with lt (<) for finding starts of each group. Groups are made by cumsum:
#for better testing helper df1
df1 = pd.DataFrame(index=df.index)
df1['dif'] = df.numbers.diff()
df1['compare'] = df.numbers.diff().lt(0)
df1['groups'] = df.numbers.diff().lt(0).cumsum()
print (df1)
dif compare groups
0 NaN False 0
1 1.0 False 0
2 1.0 False 0
3 1.0 False 0
4 0.0 False 0
5 1.0 False 0
6 -4.0 True 1
7 1.0 False 1
8 0.0 False 1
9 1.0 False 1
10 1.0 False 1
11 1.0 False 1
12 1.0 False 1
13 1.0 False 1
14 0.0 False 1
15 1.0 False 1
16 -7.0 True 2
17 0.0 False 2
18 1.0 False 2
19 0.0 False 2
20 1.0 False 2
21 1.0 False 2
22 1.0 False 2
23 1.0 False 2
24 0.0 False 2
25 1.0 False 2
df['new_numbers'] = df.groupby(df.numbers.diff().lt(0).cumsum()).cumcount() + 1
print (df)
numbers new_numbers
0 1 1
1 2 2
2 3 3
3 4 4
4 4 5
5 5 6
6 1 1
7 2 2
8 2 3
9 3 4
10 4 5
11 5 6
12 6 7
13 7 8
14 7 9
15 8 10
16 1 1
17 1 2
18 2 3
19 2 4
20 3 5
21 4 6
22 5 7
23 6 8
24 6 9
25 7 10

filtering pandas dataframe and avoid nan

Ok I have this pandas dataframe
import pandas
dfp=pandas.DataFrame([5,10,1,7,13,4,5,7,8,10,11,3])
And i want to create a second data frame with the rows that have a value greater than 5, thereby:
dfp2=dfp[dfp>5]
My problem is that I obtain this result:
0
0 NaN
1 10
2 NaN
3 7
4 13
5 NaN
6 NaN
7 7
8 8
9 10
10 11
11 NaN
And what I want is this other result:
0
0 10
1 7
2 13
3 7
4 8
5 10
6 11
What is wrong with my code?
Thanks a lot

You're using the mask generated from the comparison so where it's False it returns NaN, to get rid of those call dropna:
In [32]:
dfp[dfp > 5].dropna()
Out[32]:
0
1 10
3 7
4 13
7 7
8 8
9 10
10 11
The mask here:
In [33]:
dfp > 5
Out[33]:
0
0 False
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 True
11 False

pandas assigning series view to a series view doesn't work?

I'm trying to take a slice view from a series (logically indexed by a conditional), process it then assign the result back to that logically-indexed slice.
The LHS and RHS in the assign are Series with matching indices, but the assign ends up being scalar for some unknown reason (see bottom). How to get the desired assign? (I checked SO and pandas 0.11.0 doc for anything related).
import numpy as np
import pandas as pd
# A dataframe with sample data and some boolean conditional
df = pd.DataFrame(data={'x': range(1,20)})
df['cond'] = df.x.apply(lambda xx: ((xx%3)==1) )
# Create a new col and selectively assign to it... elsewhere being NaN...
df['newcol'] = np.nan
# This attempted assign to a view of the df doesn't work (in reality the RHS expression would actually be a return value from somefunc)
df.ix[df.cond, df.columns.get_loc('newcol')] = 2* df.ix[df.cond, df.columns.get_loc('x')]
# yet a scalar assign does...
df.ix[df.cond, df.columns.get_loc('newcol')] = 99.
# Likewise bad trying to use -df.cond as the logical index:
df.ix[-df.cond, df.columns.get_loc('newcol')] = 2* df.ix[-df.cond, df.columns.get_loc('x')]
Currently I just get a stupid scalar assign:
>>> df.ix[-df.cond, df.columns.get_loc('newcol')] = 2* df.ix[-df.cond, df.columns.get_loc('x')]
>>> df
x cond newcol
0 1 True NaN
1 2 False 4
2 3 False 4
3 4 True NaN
4 5 False 4
5 6 False 4
6 7 True NaN
7 8 False 4
8 9 False 4
9 10 True NaN
10 11 False 4
11 12 False 4
12 13 True NaN
13 14 False 4
14 15 False 4
15 16 True NaN
16 17 False 4
17 18 False 4
18 19 True NaN

In [21]: df = pd.DataFrame(data={'x': range(1,20)})
In [22]: df['cond'] = df.x.apply(lambda xx: ((xx%3)==1) )
In [23]: df
Out[23]:
x cond
0 1 True
1 2 False
2 3 False
3 4 True
4 5 False
5 6 False
6 7 True
7 8 False
8 9 False
9 10 True
10 11 False
11 12 False
12 13 True
13 14 False
14 15 False
15 16 True
16 17 False
17 18 False
18 19 True
In [24]: df['newcol'] = 2*df.loc[df.cond, 'x']
In [25]: df
Out[25]:
x cond newcol
0 1 True 2
1 2 False NaN
2 3 False NaN
3 4 True 8
4 5 False NaN
5 6 False NaN
6 7 True 14
7 8 False NaN
8 9 False NaN
9 10 True 20
10 11 False NaN
11 12 False NaN
12 13 True 26
13 14 False NaN
14 15 False NaN
15 16 True 32
16 17 False NaN
17 18 False NaN
18 19 True 38
In [10]: def myfunc(df_):
....: return 2 * df_
....:
In [26]: df['newcol'] = myfunc(df.ix[df.cond, df.columns.get_loc('newcol')])
In [27]: df
Out[27]:
x cond newcol
0 1 True 4
1 2 False NaN
2 3 False NaN
3 4 True 16
4 5 False NaN
5 6 False NaN
6 7 True 28
7 8 False NaN
8 9 False NaN
9 10 True 40
10 11 False NaN
11 12 False NaN
12 13 True 52
13 14 False NaN
14 15 False NaN
15 16 True 64
16 17 False NaN
17 18 False NaN
18 19 True 76

I found this workaround:
tmp = pd.Series(np.repeat(np.nan, len(df)))
tmp[-cond] = 2* df.loc[df.cond, 'x']
df['newcol'] = tmp
Strangely, the following sometimes works (assigning the slice to the entire Series)
(but fails with a more complex RHS with AssertionError: Length of values does not match length of index)
(According to pandas doc, the RHS Series indexes are supposed to get aligned to the LHS, well at least if the LHS is a dataframe - but not if it's a Series? Is this a bug?)
>>> df['newcol'] = 2* df.loc[df.cond, 'x']
>>> df
x cond newcol
0 1 True 2
1 2 False NaN
2 3 False NaN
3 4 True 8
4 5 False NaN
5 6 False NaN
6 7 True 14
7 8 False NaN
8 9 False NaN
9 10 True 20
10 11 False NaN
11 12 False NaN
12 13 True 26
13 14 False NaN
14 15 False NaN
15 16 True 32
16 17 False NaN
17 18 False NaN
18 19 True 38
Jeff, what's weird is we can assign to df['newcol'] (which is supposed to be a copy not a view, right?)
when we do:
df['newcol'] = 2* df.loc[df.cond, 'x']
but not when we do the same with the RHS coming from a fn:
def myfunc(df_):
"""Some func transforming and returning said Series slice"""
return 2* df_
df['newcol'] = myfunc( df.ix[df.cond, df.columns.get_loc('newcol')] )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Certain Conditions to Find Certain Parts of a Dataframe - python

Related

why i cant use np.isnan to filter dataframe?

Pandas: Select all rows between two rows

How to "iron out" a column of numbers with duplicates in it

filtering pandas dataframe and avoid nan

pandas assigning series view to a series view doesn't work?

Categories

Resources