I want to get all the rows in a dataset that are between two rows where a certain value is met. Is it possible to do that? I cannot sort the dataset because then all the crucial information will be lost.
Edit:
The dataset contains data as such:
Index| game_clock| quarter | event_type
0 | 711 | 1 | 1
1 | 710 | 1 | 3
2 | 709 | 2 | 4
3 | 708 | 3 | 2
4 | 707 | 4 | 4
5 | 706 | 4 | 1
I want to slice the dataset so that I get subsets of all the rows that are between event_type (1 or 2) and (1 or 2).
Edit 2:
Suppose the dataset is as follows:
A B
0 1 0.278179
1 2 0.069914
2 2 0.633110
3 4 0.584766
4 3 0.581232
5 3 0.677205
6 3 0.687155
7 1 0.438927
8 4 0.320927
9 3 0.570552
10 3 0.479849
11 1 0.861074
12 3 0.834805
13 4 0.105766
14 1 0.060408
15 4 0.596882
16 1 0.792395
17 3 0.226356
18 4 0.535201
19 1 0.136066
20 1 0.372244
21 1 0.151977
22 4 0.429822
23 1 0.792706
24 2 0.406957
25 1 0.177850
26 1 0.909252
27 1 0.545331
28 4 0.100497
29 2 0.718721
The subsets I would like to get are indexed as:
[0], [1], [2], [3:8], [8:12],
[12:15], [15:20], [20], [21], [22:24], [24], [25], [26], [27], [28: ]
I believe you need:
a = pd.factorize(df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index())[0]
print (a)
[ 0 1 2 3 3 3 3 3 4 4 4 4 5 5 5 6 6 7 7 7 8 9 10 10 11
12 13 14 15 15]
dfs = dict(tuple(df.groupby(a)))
print (dfs[0])
A B
0 1 0.278179
print (dfs[1])
A B
1 2 0.069914
print (dfs[2])
A B
2 2 0.63311
print (dfs[3])
A B
3 4 0.584766
4 3 0.581232
5 3 0.677205
6 3 0.687155
7 1 0.438927
print (dfs[4])
A B
8 4 0.320927
9 3 0.570552
10 3 0.479849
11 1 0.861074
Explanation:
#check values to boolean mask
a = df['A'].isin([1,2])
#reverse Series
b = df['A'].isin([1,2]).iloc[::-1]
#cumulative sum
c = df['A'].isin([1,2]).iloc[::-1].cumsum()
#get original order
d = df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index()
#factorize for keys in dictionary of DataFrames
e = pd.factorize(df['A'].isin([1,2]).iloc[::-1].cumsum().sort_index())[0]
df = pd.concat([a,pd.Series(b.values),pd.Series(c.values),d,pd.Series(e)],
axis=1, keys=list('abcde'))
print (df)
a b c d e
0 True True 1 16 0
1 True False 1 15 1
2 True True 2 14 2
3 False True 3 13 3
4 False True 4 13 3
5 False True 5 13 3
6 False True 6 13 3
7 True False 6 13 3
8 False True 7 12 4
9 False True 8 12 4
10 False True 9 12 4
11 True False 9 12 4
12 False False 9 11 5
13 False True 10 11 5
14 True False 10 11 5
15 False True 11 10 6
16 True False 11 10 6
17 False False 11 9 7
18 False True 12 9 7
19 True False 12 9 7
20 True False 12 8 8
21 True False 12 7 9
22 False True 13 6 10
23 True False 13 6 10
24 True False 13 5 11
25 True False 13 4 12
26 True False 13 3 13
27 True True 14 2 14
28 False True 15 1 15
29 True True 16 1 15
That list still doesn't make sense. Sometimes you include first occurence, sometimes not. Try this:
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame({'A': np.random.choice([1,2,3,4], 30), 'B':np.random.rand(30)})
ar = np.where(df.A.isin((1,2)))[0]
ids = list(zip(ar,ar[1:]))
for item in ids:
print(df.iloc[item[0]:item[1],:])
ids are now:
[(0, 1), (1, 2), (2, 7), (7, 11), (11, 14), (14, 16), (16, 19), (19, 20),
(20, 21), (21, 23), (23, 24), (24, 25), (25, 26), (26, 27), (27, 29)]
This will include 1 or 2 in the start and stop at 1,2 in the end.
Related
I have a dataframe that looks like this:
>>> data = {'Count':[15, 21, 1, 7, 6, 1, 25, 8, 56, 0, 5, 9, 0, 12, 12, 8, 7, 12, 0, 8]}
>>> df = pd.DataFrame(data)
>>> df
Count
0 15
1 21
2 1
3 7
4 6
5 1
6 25
7 8
8 56
9 0
10 5
11 9
12 0
13 12
14 12
15 8
16 7
17 12
18 0
19 8
I need to add two columns to this df to detect "floods". "Flood" is defined as from the row where 'Count' goes above 10 and until 'Count' drops below 5.
So, in this case, I want this as a result:
Count Flood FloodNumber
0 15 True 1
1 21 True 1
2 1 False 0
3 7 False 0
4 6 False 0
5 1 False 0
6 25 True 2
7 8 True 2
8 56 True 2
9 0 False 0
10 5 False 0
11 9 False 0
12 0 False 0
13 12 True 3
14 12 True 3
15 8 True 3
16 7 True 3
17 12 True 3
18 0 False 0
19 8 False 0
I managed to add my 'Flood' column with a simple loop like this:
df.loc[0, 'Flood'] = (df.loc[0, 'Count'] > 10)
for index in range(1, len(df)):
df.loc[index, 'Flood'] = ((df.loc[index, 'Count'] > 10) | ((df.loc[index-1, 'Flood']) & (df.loc[index, 'Count'] > 5)))
, but this seems like an extremly slow and stupid way of doing this. Is there any "proper" way of doing it using pandas functions rather than loops?
To find Flood flags, we can play with masks and ffill().
df['Flood'] = ((df.Count > 10).where(df.Count > 10)
.fillna((df.Count > 5)
.where(df.Count < 5))
.ffill()
.astype(bool))
To get the FloodNumber, let's ignore all rows which are False in the Flood column and groupby+cumsum
s = df.Flood.where(df.Flood)
df.loc[:, 'FloodNumber'] = s.dropna().groupby((s != s.shift(1)).cumsum()).ngroup().add(1)
Outputs
Count Flood FloodNumber
0 15 True 1.0
1 21 True 1.0
2 1 False NaN
3 7 False NaN
4 6 False NaN
5 1 False NaN
6 25 True 2.0
7 8 True 2.0
8 56 True 2.0
9 0 False NaN
10 5 False NaN
11 9 False NaN
12 0 False NaN
13 12 True 3.0
14 12 True 3.0
15 8 True 3.0
16 7 True 3.0
17 12 True 3.0
18 0 False NaN
19 8 False NaN
I have a TAB-delimited .txt file that looks like this.
Gene_name A B C D E F
Gene1 1 0 5 2 0 0
Gene2 4 45 0 0 32 1
Gene3 0 23 0 4 0 54
Gene4 12 0 6 8 7 4
Gene5 4 0 0 6 0 7
Gene6 0 6 8 0 0 5
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
I want to extract rows in which at least 3 columns have values > 0..
How do I do this using pandas? I am clueless about how to use conditions in .txt files.
thanks very much!
update: adding on to this question, how do I analyze specific columns for this conditon.. let's say I look into column A, C, E & F and then extract rows that have at least 3 of these columns with values >5.
cheers!
df = pd.read_csv(filename, delim_whitespace=True)
In [22]: df[df.select_dtypes(['number']).gt(0).sum(axis=1).ge(3)]
Out[22]:
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
some explanation:
In [25]: df.select_dtypes(['number']).gt(0)
Out[25]:
A B C D E F
0 True False True True False False
1 True True False False True True
2 False True False True False True
3 True False True True True True
4 True False False True False True
5 False True True False False True
6 True True True True False True
7 True True False True True True
8 True False True True False True
9 True True True True True False
In [26]: df.select_dtypes(['number']).gt(0).sum(axis=1)
Out[26]:
0 3
1 4
2 3
3 5
4 3
5 3
6 5
7 5
8 4
9 5
dtype: int64
Using operators (as a complement to Max's answer):
mask = (df.iloc[:, 1:] > 0).sum(1) >= 3
mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df[mask]
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
Similarly, querying all rows with 5 or more positive values:
df[(df.iloc[:, 1:] > 0).sum(1) >= 5]
Gene_name A B C D E F
3 Gene4 12 0 6 8 7 4
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
9 Gene10 23 4 6 7 89 0
Piggy backing off of #MaxU solution, I like go ahead put 'gene_name' into the index not worry about all that index slicing:
df = pd.read_csv(tfile, delim_whitespace=True, index_col=0)
df[df.gt(0).sum(1).ge(3)]
Edit for question update:
df[df[['A','C','E','F']].gt(5).sum(1).ge(3)]
Output:
A B C D E F
Gene_name
Gene4 12 0 6 8 7 4
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
I want to create categorical variables from my data with this method:
cat.var condition
1 x > 10
2 x == 10
3 x < 10
I try using C() method from patsy , but it doesn't work, I know in stata I have to use code below, but after searching I didn't find any clean way to do this in pyhton:
generate mpg3 = .
(74 missing values generated)
replace mpg3 = 1 if (mpg <= 18)
(27 real changes made)
replace mpg3 = 2 if (mpg >= 19) & (mpg <=23)
(24 real changes made)
replace mpg3 = 3 if (mpg >= 24) & (mpg <.)
(23 real changes made
you can do it this way (we will do it just for column: a):
In [36]: df
Out[36]:
a b c
0 10 12 6
1 12 8 8
2 10 5 8
3 14 7 7
4 7 12 11
5 14 11 8
6 7 7 14
7 11 9 11
8 5 14 9
9 9 12 9
10 7 8 8
11 13 9 8
12 13 14 6
13 9 7 13
14 12 7 5
15 6 9 8
16 6 12 12
17 7 12 13
18 7 7 6
19 8 13 9
df.a[df.a < 10] = 3
df.a[df.a == 10] = 2
df.a[df.a > 10] = 1
In [40]: df
Out[40]:
a b c
0 2 12 6
1 1 8 8
2 2 5 8
3 1 7 7
4 3 12 11
5 1 11 8
6 3 7 14
7 1 9 11
8 3 14 9
9 3 12 9
10 3 8 8
11 1 9 8
12 1 14 6
13 3 7 13
14 1 7 5
15 3 9 8
16 3 12 12
17 3 12 13
18 3 7 6
19 3 13 9
In [41]: df.a = df.a.astype('category')
In [42]: df.dtypes
Out[42]:
a category
b int32
c int32
dtype: object
I'm using this df as a sample.
>>> df
A
0 3
1 13
2 10
3 31
You could use .ix like this:
df['CAT'] = [np.nan for i in range(len(df.index))]
df.ix[df.A > 10, 'CAT'] = 1
df.ix[df.A == 10, 'CAT'] = 2
df.ix[df.A < 10, 'CAT'] = 3
Or define a function to do the job, like this:
def do_the_job(x):
ret = 3
if (x > 10):
ret = 1
elif (x == 10):
ret = 2
return ret
and finally run this over the right Series in your df, like this:
>> df['CAT'] = df.A.apply(do_the_job)
>> df
A CAT
0 3 3
1 13 1
2 10 2
3 31 1
I hope this help!
Ok I have this pandas dataframe
import pandas
dfp=pandas.DataFrame([5,10,1,7,13,4,5,7,8,10,11,3])
And i want to create a second data frame with the rows that have a value greater than 5, thereby:
dfp2=dfp[dfp>5]
My problem is that I obtain this result:
0
0 NaN
1 10
2 NaN
3 7
4 13
5 NaN
6 NaN
7 7
8 8
9 10
10 11
11 NaN
And what I want is this other result:
0
0 10
1 7
2 13
3 7
4 8
5 10
6 11
What is wrong with my code?
Thanks a lot
You're using the mask generated from the comparison so where it's False it returns NaN, to get rid of those call dropna:
In [32]:
dfp[dfp > 5].dropna()
Out[32]:
0
1 10
3 7
4 13
7 7
8 8
9 10
10 11
The mask here:
In [33]:
dfp > 5
Out[33]:
0
0 False
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 True
11 False
How do I get the index of a dataframe set as the columns and vice versa? I tried unstacking it but in vain.
I want to turn this dataframe
Type1 Type2 Type3
Hour
0 5 0 13
1 3 5 5
2 3 2 11
3 9 3 8
4 1 3 2
5 0 0 2
6 1 5 0
7 0 1 0
8 2 0 0
9 1 0 1
10 0 0 2
11 6 2 2
12 5 3 1
13 3 4 2
14 4 2 4
15 10 3 6
16 7 1 6
17 18 1 5
18 6 2 6
19 2 4 27
20 10 8 16
21 19 12 36
22 5 9 11
23 8 8 23
to the following;
0 1 2 3 4 5 6 7 8 9 10 ...
Type 1 5 3 3 9 1 ....
Type 2 0 5 2 3 3 ....
Type 3 13 5 11 8 2 ....
EDIT:
I actually have a multi index in the original df which looks like [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]. How do I handle that?
Transpose the dataframe:
df.T
Does this do the trick?
Call unstack twice:
In [47]:
df.unstack().unstack()
Out[47]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 \
Type1 5 3 3 9 1 0 1 0 2 1 0 6 5 3 4 10 7 18
Type2 0 5 2 3 3 0 5 1 0 0 0 2 3 4 2 3 1 1
Type3 13 5 11 8 2 2 0 0 0 1 2 2 1 2 4 6 6 5
18 19
Type1 6 2 ...
Type2 2 4 ...
Type3 6 27 ...
[3 rows x 24 columns]
Also .T would work:
In [48]:
df.T
Out[48]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 \
Type1 5 3 3 9 1 0 1 0 2 1 0 6 5 3 4 10 7 18
Type2 0 5 2 3 3 0 5 1 0 0 0 2 3 4 2 3 1 1
Type3 13 5 11 8 2 2 0 0 0 1 2 2 1 2 4 6 6 5
18 19
Type1 6 2 ...
Type2 2 4 ...
Type3 6 27 ...
[3 rows x 24 columns]