In Pandas, how to operate between columns in max perfornace - python

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you

I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Related

Parsing a string line by line into arrays

I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:
C
1 S 1 6665.000000 0.363803 ( 0.000692)
1 S 2 1000.000000 0.675392 ( 0.005329)
1 S 3 228.000000 1.132301 ( 0.027077)
1 S 4 64.710000 1.654004 ( 0.101718)
1 S 5 21.060000 1.924978 ( 0.274740)
1 S 6 7.495000 1.448149 ( 0.448564)
1 S 7 2.797000 0.439427 ( 0.285074)
1 S 8 0.521500 0.006650 ( 0.015204)
1 S 9 0.159600 -0.000574 ( -0.003191)
2 S 10 6665.000000 -0.076756 ( -0.000146)
2 S 11 1000.000000 -0.146257 ( -0.001154)
2 S 12 228.000000 -0.239407 ( -0.005725)
2 S 13 64.710000 -0.379069 ( -0.023312)
2 S 14 21.060000 -0.448104 ( -0.063955)
2 S 15 7.495000 -0.484201 ( -0.149981)
2 S 16 2.797000 -0.196168 ( -0.127262)
2 S 17 0.521500 0.238162 ( 0.544529)
2 S 18 0.159600 0.104468 ( 0.580496)
3 S 19 0.159600 0.179964 ( 1.000000)
4 P 20 9.439000 0.898722 ( 0.038109)
4 P 21 2.002000 0.711071 ( 0.209480)
4 P 22 0.545600 0.339917 ( 0.508557)
4 P 23 0.151700 0.063270 ( 0.468842)
5 P 24 0.151700 0.134950 ( 1.000000)
6 D 25 0.550000 0.578155 ( 1.000000)
C
7 S 26 6665.000000 0.363803 ( 0.000692)
7 S 27 1000.000000 0.675392 ( 0.005329)
7 S 28 228.000000 1.132301 ( 0.027077)
7 S 29 64.710000 1.654004 ( 0.101718)
7 S 30 21.060000 1.924978 ( 0.274740)
7 S 31 7.495000 1.448149 ( 0.448564)
7 S 32 2.797000 0.439427 ( 0.285074)
7 S 33 0.521500 0.006650 ( 0.015204)
7 S 34 0.159600 -0.000574 ( -0.003191)
8 S 35 6665.000000 -0.076756 ( -0.000146)
8 S 36 1000.000000 -0.146257 ( -0.001154)
8 S 37 228.000000 -0.239407 ( -0.005725)
8 S 38 64.710000 -0.379069 ( -0.023312)
8 S 39 21.060000 -0.448104 ( -0.063955)
8 S 40 7.495000 -0.484201 ( -0.149981)
8 S 41 2.797000 -0.196168 ( -0.127262)
8 S 42 0.521500 0.238162 ( 0.544529)
8 S 43 0.159600 0.104468 ( 0.580496)
9 S 44 0.159600 0.179964 ( 1.000000)
10 P 45 9.439000 0.898722 ( 0.038109)
10 P 46 2.002000 0.711071 ( 0.209480)
10 P 47 0.545600 0.339917 ( 0.508557)
10 P 48 0.151700 0.063270 ( 0.468842)
11 P 49 0.151700 0.134950 ( 1.000000)
12 D 50 0.550000 0.578155 ( 1.000000)
This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.
Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.
import pandas as pd
rows = []
for line in open('x.txt'):
parts = line.strip().split()
if len(parts) == 1:
print(parts[0])
counter1 = 0
counter2 = 0
elif len(parts) == 2:
counter1 += 1
shell = (counter1, parts[0])
else:
counter2 += 1
rows.append( shell + (counter2, float(parts[1]), float(parts[2])) )
df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)
Output:
CARBON
basisnum basistype primitive energy delta
0 1 S 1 6665.0000 0.000692
1 1 S 2 1000.0000 0.005329
2 1 S 3 228.0000 0.027077
3 1 S 4 64.7100 0.101718
4 1 S 5 21.0600 0.274740
5 1 S 6 7.4950 0.448564
6 1 S 7 2.7970 0.285074
7 1 S 8 0.5215 0.015204
8 1 S 9 0.1596 -0.003191
9 2 S 10 6665.0000 -0.000146
10 2 S 11 1000.0000 -0.001154
11 2 S 12 228.0000 -0.005725
12 2 S 13 64.7100 -0.023312
13 2 S 14 21.0600 -0.063955
14 2 S 15 7.4950 -0.149981
15 2 S 16 2.7970 -0.127262
16 2 S 17 0.5215 0.544529
17 2 S 18 0.1596 0.580496
18 3 S 19 0.1596 1.000000
19 4 P 20 9.4390 0.038109
20 4 P 21 2.0020 0.209480
21 4 P 22 0.5456 0.508557
22 4 P 23 0.1517 0.468842
23 5 P 24 0.1517 1.000000
24 6 D 25 0.5500 1.000000
Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.
import basis_set_exchange as bse
import pandas as pd
basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]
print(basis, '\n')
buf = basis.split('\n')
buf.pop(2)
shellNumber = 0
shellType = ''
rows = []
for line in buf:
parts = line.strip().split()
if (len(parts) == 2):
shellType = parts[0]
shellNumber += 1
elif (len(parts) == 3):
rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))
df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)
Output:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
SHELL TYPE SHELL NO EXPONENT CONTR COEF
0 S 1 6665.0000 0.000692
1 S 1 1000.0000 0.005329
2 S 1 228.0000 0.027077
3 S 1 64.7100 0.101718
4 S 1 21.0600 0.274740
5 S 1 7.4950 0.448564
6 S 1 2.7970 0.285074
7 S 1 0.5215 0.015204
8 S 1 0.1596 -0.003191
9 S 2 6665.0000 -0.000146
10 S 2 1000.0000 -0.001154
11 S 2 228.0000 -0.005725
12 S 2 64.7100 -0.023312
13 S 2 21.0600 -0.063955
14 S 2 7.4950 -0.149981
15 S 2 2.7970 -0.127262
16 S 2 0.5215 0.544529
17 S 2 0.1596 0.580496
18 S 3 0.1596 1.000000
19 P 4 9.4390 0.038109
20 P 4 2.0020 0.209480
21 P 4 0.5456 0.508557
22 P 4 0.1517 0.468842
23 P 5 0.1517 1.000000
24 D 6 0.5500 1.000000
After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.

Compare preceding two rows with subsequent two rows of each group till last record

I had a question earlier which is deleted and now modified to a less verbose form for you to read easily.
I have a dataframe as given below
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['fake_flag'] = ''
I would like to fill values in column fake_flag based on the below rules
1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5)
2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column
This is what I tried
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]): # rule 2 check
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records.
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
import pandas as pd
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['shift1']=df['PEEP'].shift(1)
df['shift2']=df['PEEP'].shift(2)
df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '')
df.drop(['shift1','shift2'],axis=1)
Output
0 1 1 7
1 1 2 5
2 1 3 10 fake VAC
3 1 4 10
4 1 5 11 fake VAC
5 1 6 11
6 1 7 14 fake VAC
7 1 8 14
8 1 9 17 fake VAC
9 1 10 17
10 1 11 21 fake VAC
11 1 12 21
12 1 13 23 fake VAC
13 1 14 23
14 1 15 25 fake VAC
15 1 16 25
16 1 17 22 fake VAC
17 1 18 20 fake VAC
18 1 19 26 fake VAC
19 1 20 26
20 2 1 5 fake VAC
21 2 2 7 fake VAC
22 2 3 8
23 2 4 8
24 2 5 9 fake VAC
25 2 6 9
26 2 7 13 fake VAC

Group by one column compare another column and add values to a new column in Python?

I have these columns:
index, area, key0
I have to group by index (it is a normal column called index) in order to take the rows that have the same value.
#all the ones, all the twos, etc
Some of them (rows) are unique though.
About the ones that are not unique now:
What I have done so far:
I have to check with a group by which of the groups have the largest area and give its respected key0 value to the others in its group in a new column called key1.
The unique values are going to still have the same value they had in key0 in the now key1 column
First I checked which of those occur more than once in order to know which are going to form groups.
df['index'].value_counts()[df['index'].value_counts()>1]
359 9
391 8
376 7
374 6
354 5
446 4
403 4
348 4
422 4
424 4
451 4
364 3
315 3
100 3
245 3
345 3
247 3
346 3
347 3
351 3
which worked fine. The thing now is how to do the rest?
the dataset:
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
dataset
import geopandas as gpd
inte=gpd.read_file('in.shp')
inte["rank_gr"] = inte.groupby("index")["area_of_poly"].rank(ascending = False, method =
"first")
inte["key1_temp"] = inte.apply(lambda row: str(row[""]) if row["rank_gr"] == 1.0
else "", axis = 1)
inte["CAD_ADMIN_FINAL"] = inte.groupby("index")["key1_temp"].transform("sum")
print (inte[["area_of_poly", "index", "CAD_ADMIN", "CAD_ADMIN_FINAL"]])
Check with the data you provided. And it works. Haven't found any "key0" column so assumed it can be "CAD_ADMIN". "AREA" is only one value so I took "AREA_2".
import geopandas as gpd
# set your path
path = r"p\in.shp"
p = gpd.read_file(path)
p["rank_gr"] = p.groupby("index")["AREA_2"].rank(ascending = False, method =
"first")
p["key1_temp"] = p.apply(lambda row: str(row["CAD_ADMIN"]) if row["rank_gr"] == 1.0
else "", axis = 1)
p["key1"] = p.groupby("index")["key1_temp"].transform("sum")
p = p[["AREA_2", "index", "CAD_ADMIN", "key1"]]
print(p.sort_values(by = ["index"]))
AREA_2 index CAD_ADMIN key1
1.866706e+06 0 0113924 0113924
1.559865e+06 1 0113927 0113926
1.593623e+06 1 0113926 0113926
1.927774e+06 2 0113922 0113922
1.927774e+06 3 0113922 0113922
Do you mean something like this?
import pandas as pd
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
df["rank_gr"] = df.groupby("index")["area"].rank(ascending = False, method =
"first")
df["key1_temp"] = df.apply(lambda row: str(row["key0"]) if row["rank_gr"] == 1.0
else "", axis = 1)
df["key1"] = df.groupby("index")["key1_temp"].transform("sum")
print df[["area", "index", "key0", "key1"]]
# OUTPUT
area index key0 key1
50 1 1f 5
60 2 2 6
70 3 3d 3d
80 5 4 4
90 1 5 5
100 2 6 6
10 3 7 3d
20 3 8 3d
70 3 9 3d

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

How to count row-by-row with a condition

I want to count STATE and ID columns row-by-row in my DataFrame, but I am getting a KeyError. My final code is below. For each ID number (1 to 12) I want to count the state changes. Here is my data, and I have thousands of these data.
#this code works for state column
chars = "TGC-"
nums = {}
for char in chars:
s = df["STATE"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums["T"]
AGnum = nums["G"]
ACnum= nums["C"]
A_num= nums["-"]
ATnum
Out[26]:
False 51919
True 1248
dtype: int64
# and this one works for id column
pt = df.sort("ID")["ID"]
pt_num=pt.value_counts()
pt_values= pt_num.order()
pt_index= pt_num.sort_index()
#these are the total numbers of each id datas
pt_num
Out[27]:
10 5241
6 5144
11 4561
2 4439
3 4346
5 4284
9 4244
12 4218
7 4217
1 4210
4 4199
8 4064
dtype: int64
# i combine both ID and STATE columns and try to read row-by-row
draft
Out[21]:
ID STATE
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
11 2 chr1:100671861:T:-
12 4 chr1:1021390:C:A
13 5 chr1:10228220:G:C
14 3 chr1:1026913:C:T
15 4 chr1:1026913:C:T
... ... ...
53152 6 chrY:21618583:G:C
53153 5 chrY:24443836:T:G
53154 6 chrY:24443836:T:G
53155 8 chrY:24443836:T:G
53156 10 chrY:24443836:T:G
53157 12 chrY:24443836:T:G
53158 3 chrY:5605924:C:T
53159 2 chrY:6932151:G:A
53160 10 chrY:7224175:G:T
53161 2 chrY:9197998:C:T
53162 3 chrY:9197998:C:T
53163 4 chrY:9197998:C:T
53164 11 chrY:9197998:C:T
53165 12 chrY:9197998:C:T
53166 11 chrY:9304866:G:A
[53167 rows x 2 columns]
draft= df[["ID", "STATE" ]]
chars = "TGC-"
number = {}
d = draft
for i in d["ID"]:
if i==1:
for item in chars:
At = d["STATE"].str.contains("A:" + item)
num = At.value_counts(sort=True)
number[item] = num
ATn=number["T"]
AGn=number["G"]
ACn=number["C"]
A_n=number["-"]
KeyError: 'G'
In total, what I want to do is, for example, ID 1 has 4210 rows, I want to determine how many of these have a state of A:T, A:G, A:C and A:-.
Where am I going wrong?

Categories

Resources