I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:
C
1 S 1 6665.000000 0.363803 ( 0.000692)
1 S 2 1000.000000 0.675392 ( 0.005329)
1 S 3 228.000000 1.132301 ( 0.027077)
1 S 4 64.710000 1.654004 ( 0.101718)
1 S 5 21.060000 1.924978 ( 0.274740)
1 S 6 7.495000 1.448149 ( 0.448564)
1 S 7 2.797000 0.439427 ( 0.285074)
1 S 8 0.521500 0.006650 ( 0.015204)
1 S 9 0.159600 -0.000574 ( -0.003191)
2 S 10 6665.000000 -0.076756 ( -0.000146)
2 S 11 1000.000000 -0.146257 ( -0.001154)
2 S 12 228.000000 -0.239407 ( -0.005725)
2 S 13 64.710000 -0.379069 ( -0.023312)
2 S 14 21.060000 -0.448104 ( -0.063955)
2 S 15 7.495000 -0.484201 ( -0.149981)
2 S 16 2.797000 -0.196168 ( -0.127262)
2 S 17 0.521500 0.238162 ( 0.544529)
2 S 18 0.159600 0.104468 ( 0.580496)
3 S 19 0.159600 0.179964 ( 1.000000)
4 P 20 9.439000 0.898722 ( 0.038109)
4 P 21 2.002000 0.711071 ( 0.209480)
4 P 22 0.545600 0.339917 ( 0.508557)
4 P 23 0.151700 0.063270 ( 0.468842)
5 P 24 0.151700 0.134950 ( 1.000000)
6 D 25 0.550000 0.578155 ( 1.000000)
C
7 S 26 6665.000000 0.363803 ( 0.000692)
7 S 27 1000.000000 0.675392 ( 0.005329)
7 S 28 228.000000 1.132301 ( 0.027077)
7 S 29 64.710000 1.654004 ( 0.101718)
7 S 30 21.060000 1.924978 ( 0.274740)
7 S 31 7.495000 1.448149 ( 0.448564)
7 S 32 2.797000 0.439427 ( 0.285074)
7 S 33 0.521500 0.006650 ( 0.015204)
7 S 34 0.159600 -0.000574 ( -0.003191)
8 S 35 6665.000000 -0.076756 ( -0.000146)
8 S 36 1000.000000 -0.146257 ( -0.001154)
8 S 37 228.000000 -0.239407 ( -0.005725)
8 S 38 64.710000 -0.379069 ( -0.023312)
8 S 39 21.060000 -0.448104 ( -0.063955)
8 S 40 7.495000 -0.484201 ( -0.149981)
8 S 41 2.797000 -0.196168 ( -0.127262)
8 S 42 0.521500 0.238162 ( 0.544529)
8 S 43 0.159600 0.104468 ( 0.580496)
9 S 44 0.159600 0.179964 ( 1.000000)
10 P 45 9.439000 0.898722 ( 0.038109)
10 P 46 2.002000 0.711071 ( 0.209480)
10 P 47 0.545600 0.339917 ( 0.508557)
10 P 48 0.151700 0.063270 ( 0.468842)
11 P 49 0.151700 0.134950 ( 1.000000)
12 D 50 0.550000 0.578155 ( 1.000000)
This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.
Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.
import pandas as pd
rows = []
for line in open('x.txt'):
parts = line.strip().split()
if len(parts) == 1:
print(parts[0])
counter1 = 0
counter2 = 0
elif len(parts) == 2:
counter1 += 1
shell = (counter1, parts[0])
else:
counter2 += 1
rows.append( shell + (counter2, float(parts[1]), float(parts[2])) )
df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)
Output:
CARBON
basisnum basistype primitive energy delta
0 1 S 1 6665.0000 0.000692
1 1 S 2 1000.0000 0.005329
2 1 S 3 228.0000 0.027077
3 1 S 4 64.7100 0.101718
4 1 S 5 21.0600 0.274740
5 1 S 6 7.4950 0.448564
6 1 S 7 2.7970 0.285074
7 1 S 8 0.5215 0.015204
8 1 S 9 0.1596 -0.003191
9 2 S 10 6665.0000 -0.000146
10 2 S 11 1000.0000 -0.001154
11 2 S 12 228.0000 -0.005725
12 2 S 13 64.7100 -0.023312
13 2 S 14 21.0600 -0.063955
14 2 S 15 7.4950 -0.149981
15 2 S 16 2.7970 -0.127262
16 2 S 17 0.5215 0.544529
17 2 S 18 0.1596 0.580496
18 3 S 19 0.1596 1.000000
19 4 P 20 9.4390 0.038109
20 4 P 21 2.0020 0.209480
21 4 P 22 0.5456 0.508557
22 4 P 23 0.1517 0.468842
23 5 P 24 0.1517 1.000000
24 6 D 25 0.5500 1.000000
Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.
import basis_set_exchange as bse
import pandas as pd
basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]
print(basis, '\n')
buf = basis.split('\n')
buf.pop(2)
shellNumber = 0
shellType = ''
rows = []
for line in buf:
parts = line.strip().split()
if (len(parts) == 2):
shellType = parts[0]
shellNumber += 1
elif (len(parts) == 3):
rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))
df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)
Output:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
SHELL TYPE SHELL NO EXPONENT CONTR COEF
0 S 1 6665.0000 0.000692
1 S 1 1000.0000 0.005329
2 S 1 228.0000 0.027077
3 S 1 64.7100 0.101718
4 S 1 21.0600 0.274740
5 S 1 7.4950 0.448564
6 S 1 2.7970 0.285074
7 S 1 0.5215 0.015204
8 S 1 0.1596 -0.003191
9 S 2 6665.0000 -0.000146
10 S 2 1000.0000 -0.001154
11 S 2 228.0000 -0.005725
12 S 2 64.7100 -0.023312
13 S 2 21.0600 -0.063955
14 S 2 7.4950 -0.149981
15 S 2 2.7970 -0.127262
16 S 2 0.5215 0.544529
17 S 2 0.1596 0.580496
18 S 3 0.1596 1.000000
19 P 4 9.4390 0.038109
20 P 4 2.0020 0.209480
21 P 4 0.5456 0.508557
22 P 4 0.1517 0.468842
23 P 5 0.1517 1.000000
24 D 6 0.5500 1.000000
After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.
I had a question earlier which is deleted and now modified to a less verbose form for you to read easily.
I have a dataframe as given below
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['fake_flag'] = ''
I would like to fill values in column fake_flag based on the below rules
1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5)
2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column
This is what I tried
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]): # rule 2 check
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records.
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
import pandas as pd
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['shift1']=df['PEEP'].shift(1)
df['shift2']=df['PEEP'].shift(2)
df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '')
df.drop(['shift1','shift2'],axis=1)
Output
0 1 1 7
1 1 2 5
2 1 3 10 fake VAC
3 1 4 10
4 1 5 11 fake VAC
5 1 6 11
6 1 7 14 fake VAC
7 1 8 14
8 1 9 17 fake VAC
9 1 10 17
10 1 11 21 fake VAC
11 1 12 21
12 1 13 23 fake VAC
13 1 14 23
14 1 15 25 fake VAC
15 1 16 25
16 1 17 22 fake VAC
17 1 18 20 fake VAC
18 1 19 26 fake VAC
19 1 20 26
20 2 1 5 fake VAC
21 2 2 7 fake VAC
22 2 3 8
23 2 4 8
24 2 5 9 fake VAC
25 2 6 9
26 2 7 13 fake VAC
I have these columns:
index, area, key0
I have to group by index (it is a normal column called index) in order to take the rows that have the same value.
#all the ones, all the twos, etc
Some of them (rows) are unique though.
About the ones that are not unique now:
What I have done so far:
I have to check with a group by which of the groups have the largest area and give its respected key0 value to the others in its group in a new column called key1.
The unique values are going to still have the same value they had in key0 in the now key1 column
First I checked which of those occur more than once in order to know which are going to form groups.
df['index'].value_counts()[df['index'].value_counts()>1]
359 9
391 8
376 7
374 6
354 5
446 4
403 4
348 4
422 4
424 4
451 4
364 3
315 3
100 3
245 3
345 3
247 3
346 3
347 3
351 3
which worked fine. The thing now is how to do the rest?
the dataset:
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
dataset
import geopandas as gpd
inte=gpd.read_file('in.shp')
inte["rank_gr"] = inte.groupby("index")["area_of_poly"].rank(ascending = False, method =
"first")
inte["key1_temp"] = inte.apply(lambda row: str(row[""]) if row["rank_gr"] == 1.0
else "", axis = 1)
inte["CAD_ADMIN_FINAL"] = inte.groupby("index")["key1_temp"].transform("sum")
print (inte[["area_of_poly", "index", "CAD_ADMIN", "CAD_ADMIN_FINAL"]])
Check with the data you provided. And it works. Haven't found any "key0" column so assumed it can be "CAD_ADMIN". "AREA" is only one value so I took "AREA_2".
import geopandas as gpd
# set your path
path = r"p\in.shp"
p = gpd.read_file(path)
p["rank_gr"] = p.groupby("index")["AREA_2"].rank(ascending = False, method =
"first")
p["key1_temp"] = p.apply(lambda row: str(row["CAD_ADMIN"]) if row["rank_gr"] == 1.0
else "", axis = 1)
p["key1"] = p.groupby("index")["key1_temp"].transform("sum")
p = p[["AREA_2", "index", "CAD_ADMIN", "key1"]]
print(p.sort_values(by = ["index"]))
AREA_2 index CAD_ADMIN key1
1.866706e+06 0 0113924 0113924
1.559865e+06 1 0113927 0113926
1.593623e+06 1 0113926 0113926
1.927774e+06 2 0113922 0113922
1.927774e+06 3 0113922 0113922
Do you mean something like this?
import pandas as pd
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
df["rank_gr"] = df.groupby("index")["area"].rank(ascending = False, method =
"first")
df["key1_temp"] = df.apply(lambda row: str(row["key0"]) if row["rank_gr"] == 1.0
else "", axis = 1)
df["key1"] = df.groupby("index")["key1_temp"].transform("sum")
print df[["area", "index", "key0", "key1"]]
# OUTPUT
area index key0 key1
50 1 1f 5
60 2 2 6
70 3 3d 3d
80 5 4 4
90 1 5 5
100 2 6 6
10 3 7 3d
20 3 8 3d
70 3 9 3d
I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.