Get row numbers based on column values from numpy array - python

I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.

Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0

Related

Generate count column for IDs in a Pandas DataFrame

Here how you can generate the dummy version of my Pandas DataFrame:
import pandas as pd
usr_id = [121,121,121,121,135,135,135,135,135,135,135,135,135]
ses_id = [95,95,95,108,97,97,97,97,98,98,98,101,101]
que_id = [1,8,15,23,1,42,9,5,7,9,10,17,20]
df = pd.DataFrame(list(zip(usr_id, ses_id, que_id)),
columns =['usr_id', 'ses_id', 'que_id'])
usr_id
ses_id
que_id
121
95
1
121
95
8
121
95
15
121
108
23
135
97
1
135
97
42
135
97
9
135
97
5
135
98
7
135
98
9
135
98
10
135
101
17
135
101
20
A user can attempt multiple sessions where each sesssion can have varying number of multiple questions. I need to create two columns which will number the session and question i.e, (session number or question number 1, 2, 3...) for each indiviudal users. Something like this:
usr_id
ses_id
que_id
ses_no
que_no
121
95
1
1
1
121
95
8
1
2
121
95
15
1
3
121
108
23
2
1
135
97
1
1
1
135
97
42
1
2
135
97
9
1
3
135
97
5
1
4
135
98
7
2
1
135
98
9
2
2
135
98
10
2
3
135
101
17
3
1
135
101
20
3
2
So session_id 95 was the first session usr_id 121 attempted within which he attempted three questions que_id 1, 8 & 15. Next session attempted by the same user is ses_id 108 with only 1 question que_id 23. Another user, usr_id 135 atempted it's first session recorded as ses_id 97 in which he attempted four questions que_id 1, 42, 9 & 5. The second session from the same user now is ses_id 98 and so on.
I managed to generate the 'que_no' using the following:
df['que_no'] = df.groupby('ses_id').cumcount()+1
But could't find a way to do the same for ses_no.
I am also having an idea of using .shift() to compare whether there is a change in 'usr_id' and/or 'ses_id and some how apply a count logic on the output. Something like this:
i = df.usr_id
j = df.sess_id
i_shift_ne = i.ne(i.shift())
j_shift_ne = j.ne(j.shift())
Not sure whether this idea will work or not also I am pretty sure there has to be a smarter way of doing this. It will be great if we can make it happen using pandas library itself.
IIUC use custom lambda function per usr_id with factorize:
df['ses_no'] = df.groupby('usr_id')['ses_id'].transform(lambda x: pd.factorize(x)[0]) + 1
#if values are sorted
#df['ses_no'] = df.groupby('usr_id')['ses_id'].rank(method='dense').astype(int)
df['que_no'] = df.groupby(['usr_id','ses_no']).cumcount()+1
print (df)
usr_id ses_id que_id ses_no que_no
0 121 95 1 1 1
1 121 95 8 1 2
2 121 95 15 1 3
3 121 108 23 2 1
4 135 97 1 1 1
5 135 97 42 1 2
6 135 97 9 1 3
7 135 97 5 1 4
8 135 98 7 2 1
9 135 98 9 2 2
10 135 98 10 2 3
11 135 101 17 3 1
12 135 101 20 3 2

Loop through dataframe in python to select specific row

I have a timeseries data of 5864 ICU Patients and my dataframe is like this. Each row is the ICU stay of respective patient at a particular hour.
HR
SBP
DBP
ICULOS
Sepsis
P_ID
92
120
80
1
0
0
98
115
85
2
0
0
93
125
75
3
1
0
95
130
90
4
1
0
102
120
80
1
0
1
109
115
75
2
0
1
94
135
100
3
0
1
97
100
70
4
1
1
85
120
80
5
1
1
88
115
75
6
1
1
93
125
85
1
0
2
78
130
90
2
0
2
115
140
110
3
0
2
102
120
80
4
0
2
98
140
110
5
1
2
I want to select the ICULOS where Sepsis = 1 (first hour only) based on patient ID. Like in P_ID = 0, Sepsis = 1 at ICULOS = 3. I did this on a single patient (the dataframe having data of only a single patient) using the code:
x = df[df['Sepsis'] == 1]["ICULOS"].values[0]
print("ICULOS at which Sepsis Label = 1 is:", x)
# Output
ICULOS at which Sepsis Label = 1 is: 46
If I want to check it for each P_ID, I have to do this 5864 times. Can someone help me with the code using a loop? The loop will go to each P_ID and then give the result of ICULOS where Sepsis = 1. Looking forward for help.
for x in df['P_ID'].unique():
print(df.query('P_ID == #x and Sepsis == 1')['ICULOS'][0])
First, filter the rows which have Sepsis=1. It will automatically filter the P_IDs which don't have Sepsis as 1. Thus, you will have fewer patients to iterate.
df1 = df[df.Sepsis==1]
for pid in df.P_ID.unique():
if pid not in df.P_ID:
print("P_ID: {pid} - it has no iclus at Sepsis Lable = 1")
else:
iclus = df1[df1.P_ID==pid].ICULOS.values[0]
print(f"P_ID: {pid} - ICULOS at which Sepsis Label = 1 is: {iclus}")

selecting a column from pandas pivot table

I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])

Python Pandas add rows based on missing sequential values in a timeseries

I'm new to python and struggling to manipulate data in pandas library. I have a pandas database like this:
Year Value
0 91 1
1 93 4
2 94 7
3 95 10
4 98 13
And want to complete the missing years creating rows with empty values, like this:
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13
How do i do that in Python?
(I wanna do that so I can plot Values without skipping years)
I would create a new dataframe that has Year as an Index and includes the entire date range that you need to cover. Then you can simply set the values across the two dataframes, and the index will make sure that they correct rows are matched (I've had to use fillna to set the missing years to zero, by default they will be set to NaN):
df = pd.DataFrame({'Year':[91,93,94,95,98],'Value':[1,4,7,10,13]})
df.index = df.Year
df2 = pd.DataFrame({'Year':range(91,99), 'Value':0})
df2.index = df2.Year
df2.Value = df.Value
df2= df2.fillna(0)
df2
Value Year
Year
91 1 91
92 0 92
93 4 93
94 7 94
95 10 95
96 0 96
97 0 97
98 13 98
Finally you can use reset_index if you don't want Year as your index:
df2.drop('Year',1).reset_index()
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13

Split output of loop by columns used as input

Hi I'm relatively new to Python and am currently working on trying to measure the width of features in an image. The resolution of my image is 1m so measuring the width should be easier. I've managed to select certain columns or rows of the image and extract the necessary data using loops and such. My code is below:
subset = imarray[:,::500]#(imarray.shape[1]/2):(imarray.shape[1]/2)+1]
subset[(subset > 0) & (subset <= 17)] = 1
subset[(subset > 17)] = 0
width = []
count = 0
for i in np.arange(subset.shape[1]):
column = subset[:,i]
for value in column:
if (value == 1):
count += 1
width.append(count)
width_arr = np.array(width).astype('uint8')
else:
count = 0
final = np.split(width_arr, np.argwhere(width_arr == 1).flatten())
final2 = [x for x in final if x != []]
width2 = []
for array in final2:
width2.append(max(array))
width2 = np.array(width2).astype('uint8')
print width2
I can't figure out how to split the output up so it shows the results for each column or row individually. Instead all I've been able to do is to append the data to an empty list and here's the output for that:
[ 70 35 4 2 5 36 4 5 2 51 97 4 228 3 21 47 7 21
23 58 126 4 111 2 2 5 3 2 18 15 6 19 3 3 12 15
6 8 2 4 6 88 122 24 14 49 73 57 74 6 179 8 3 2
6 3 184 9 3 19 24 3 2 2 3 255 30 8 191 33 127 5
3 27 112 2 24 2 5 2 10 30 10 6 37 2 38 6 12 17
44 67 23 5 101 10 9 4 6 4 255 136 5 255 255 255 255 26
255 235 148 4 255 199 3 2 114 87 255 109 69 12 41 20 30 57
72 89 32]
So these are the widths of the features in all the columns appended together. How do I use my loop or another method to split these up into individual numpy arrays representing each column I've sliced out of the original?
It seems like I am almost there but I can't seem to figure that last step out and it's driving me nuts.
Thanks in advance for your help!

Categories

Resources