Parsing a string line by line into arrays - python
I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:
C
1 S 1 6665.000000 0.363803 ( 0.000692)
1 S 2 1000.000000 0.675392 ( 0.005329)
1 S 3 228.000000 1.132301 ( 0.027077)
1 S 4 64.710000 1.654004 ( 0.101718)
1 S 5 21.060000 1.924978 ( 0.274740)
1 S 6 7.495000 1.448149 ( 0.448564)
1 S 7 2.797000 0.439427 ( 0.285074)
1 S 8 0.521500 0.006650 ( 0.015204)
1 S 9 0.159600 -0.000574 ( -0.003191)
2 S 10 6665.000000 -0.076756 ( -0.000146)
2 S 11 1000.000000 -0.146257 ( -0.001154)
2 S 12 228.000000 -0.239407 ( -0.005725)
2 S 13 64.710000 -0.379069 ( -0.023312)
2 S 14 21.060000 -0.448104 ( -0.063955)
2 S 15 7.495000 -0.484201 ( -0.149981)
2 S 16 2.797000 -0.196168 ( -0.127262)
2 S 17 0.521500 0.238162 ( 0.544529)
2 S 18 0.159600 0.104468 ( 0.580496)
3 S 19 0.159600 0.179964 ( 1.000000)
4 P 20 9.439000 0.898722 ( 0.038109)
4 P 21 2.002000 0.711071 ( 0.209480)
4 P 22 0.545600 0.339917 ( 0.508557)
4 P 23 0.151700 0.063270 ( 0.468842)
5 P 24 0.151700 0.134950 ( 1.000000)
6 D 25 0.550000 0.578155 ( 1.000000)
C
7 S 26 6665.000000 0.363803 ( 0.000692)
7 S 27 1000.000000 0.675392 ( 0.005329)
7 S 28 228.000000 1.132301 ( 0.027077)
7 S 29 64.710000 1.654004 ( 0.101718)
7 S 30 21.060000 1.924978 ( 0.274740)
7 S 31 7.495000 1.448149 ( 0.448564)
7 S 32 2.797000 0.439427 ( 0.285074)
7 S 33 0.521500 0.006650 ( 0.015204)
7 S 34 0.159600 -0.000574 ( -0.003191)
8 S 35 6665.000000 -0.076756 ( -0.000146)
8 S 36 1000.000000 -0.146257 ( -0.001154)
8 S 37 228.000000 -0.239407 ( -0.005725)
8 S 38 64.710000 -0.379069 ( -0.023312)
8 S 39 21.060000 -0.448104 ( -0.063955)
8 S 40 7.495000 -0.484201 ( -0.149981)
8 S 41 2.797000 -0.196168 ( -0.127262)
8 S 42 0.521500 0.238162 ( 0.544529)
8 S 43 0.159600 0.104468 ( 0.580496)
9 S 44 0.159600 0.179964 ( 1.000000)
10 P 45 9.439000 0.898722 ( 0.038109)
10 P 46 2.002000 0.711071 ( 0.209480)
10 P 47 0.545600 0.339917 ( 0.508557)
10 P 48 0.151700 0.063270 ( 0.468842)
11 P 49 0.151700 0.134950 ( 1.000000)
12 D 50 0.550000 0.578155 ( 1.000000)
This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.
Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.
import pandas as pd
rows = []
for line in open('x.txt'):
parts = line.strip().split()
if len(parts) == 1:
print(parts[0])
counter1 = 0
counter2 = 0
elif len(parts) == 2:
counter1 += 1
shell = (counter1, parts[0])
else:
counter2 += 1
rows.append( shell + (counter2, float(parts[1]), float(parts[2])) )
df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)
Output:
CARBON
basisnum basistype primitive energy delta
0 1 S 1 6665.0000 0.000692
1 1 S 2 1000.0000 0.005329
2 1 S 3 228.0000 0.027077
3 1 S 4 64.7100 0.101718
4 1 S 5 21.0600 0.274740
5 1 S 6 7.4950 0.448564
6 1 S 7 2.7970 0.285074
7 1 S 8 0.5215 0.015204
8 1 S 9 0.1596 -0.003191
9 2 S 10 6665.0000 -0.000146
10 2 S 11 1000.0000 -0.001154
11 2 S 12 228.0000 -0.005725
12 2 S 13 64.7100 -0.023312
13 2 S 14 21.0600 -0.063955
14 2 S 15 7.4950 -0.149981
15 2 S 16 2.7970 -0.127262
16 2 S 17 0.5215 0.544529
17 2 S 18 0.1596 0.580496
18 3 S 19 0.1596 1.000000
19 4 P 20 9.4390 0.038109
20 4 P 21 2.0020 0.209480
21 4 P 22 0.5456 0.508557
22 4 P 23 0.1517 0.468842
23 5 P 24 0.1517 1.000000
24 6 D 25 0.5500 1.000000
Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.
import basis_set_exchange as bse
import pandas as pd
basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]
print(basis, '\n')
buf = basis.split('\n')
buf.pop(2)
shellNumber = 0
shellType = ''
rows = []
for line in buf:
parts = line.strip().split()
if (len(parts) == 2):
shellType = parts[0]
shellNumber += 1
elif (len(parts) == 3):
rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))
df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)
Output:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
SHELL TYPE SHELL NO EXPONENT CONTR COEF
0 S 1 6665.0000 0.000692
1 S 1 1000.0000 0.005329
2 S 1 228.0000 0.027077
3 S 1 64.7100 0.101718
4 S 1 21.0600 0.274740
5 S 1 7.4950 0.448564
6 S 1 2.7970 0.285074
7 S 1 0.5215 0.015204
8 S 1 0.1596 -0.003191
9 S 2 6665.0000 -0.000146
10 S 2 1000.0000 -0.001154
11 S 2 228.0000 -0.005725
12 S 2 64.7100 -0.023312
13 S 2 21.0600 -0.063955
14 S 2 7.4950 -0.149981
15 S 2 2.7970 -0.127262
16 S 2 0.5215 0.544529
17 S 2 0.1596 0.580496
18 S 3 0.1596 1.000000
19 P 4 9.4390 0.038109
20 P 4 2.0020 0.209480
21 P 4 0.5456 0.508557
22 P 4 0.1517 0.468842
23 P 5 0.1517 1.000000
24 D 6 0.5500 1.000000
After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.
Related
Reassign index of a dataframe
I have the following dataframe: Month 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 Name: Passengers, dtype: float64 As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following: plt.figure(figsize=(15,5)) plt.plot(esta2,color='orange') plt.show() I would like to see a continuous line from 1 to 24.
esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).
quite simply : df.index = [i for i in range(1,len(df.index)+1)] df.index.name = 'Month' print(df) Val Month 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 13 -0.075844 14 -0.089111 15 0.042705 16 0.002147 17 -0.010528 18 0.109443 19 0.198334 20 0.209830 21 0.075139 22 -0.062405 23 -0.211774 24 -0.109167
Just reassign the index: df.index = pd.Index(range(1, len(df) + 1), name='Month')
Compare preceding two rows with subsequent two rows of each group till last record
I had a question earlier which is deleted and now modified to a less verbose form for you to read easily. I have a dataframe as given below df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]}) df['fake_flag'] = '' I would like to fill values in column fake_flag based on the below rules 1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5) 2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column This is what I tried for i in t1.index: if i >=2: print("current value is ", t1[i]) print("preceding 1st (n-1) ", t1[i-1]) print("preceding 2nd (n-2) ", t1[i-2]) if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway print("rule 1 output is ", r1_output) if t1[i] >= r1_output + 3: print("found a value for rule 2", t1[i]) print("check for next value is same as current value", t1[i+1]) if (t1[i]==t1[i+1]): # rule 2 check print("fake flag is being set") df['fake_flag'][i] = 'fake_vac' This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records. I expect my output to be like as shown below subject_id = 1 subject_id = 2
import pandas as pd df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]}) df['shift1']=df['PEEP'].shift(1) df['shift2']=df['PEEP'].shift(2) df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '') df.drop(['shift1','shift2'],axis=1) Output 0 1 1 7 1 1 2 5 2 1 3 10 fake VAC 3 1 4 10 4 1 5 11 fake VAC 5 1 6 11 6 1 7 14 fake VAC 7 1 8 14 8 1 9 17 fake VAC 9 1 10 17 10 1 11 21 fake VAC 11 1 12 21 12 1 13 23 fake VAC 13 1 14 23 14 1 15 25 fake VAC 15 1 16 25 16 1 17 22 fake VAC 17 1 18 20 fake VAC 18 1 19 26 fake VAC 19 1 20 26 20 2 1 5 fake VAC 21 2 2 7 fake VAC 22 2 3 8 23 2 4 8 24 2 5 9 fake VAC 25 2 6 9 26 2 7 13 fake VAC
Groupby on condition and calculate sum of subgroups
Here is my data: import numpy as np import pandas as pd z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]}) z a b c 0 1 3 10 1 1 4 11 2 1 5 12 3 2 6 13 4 2 7 14 5 3 8 15 6 3 9 16 Question: How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all. Here is the code I wrote: (It runs but I cannot get the correct result) gbz = z.groupby('a') # For displaying the groups: gbz.apply(lambda x: print(x)) list = [] def f(x): list_new = [] for row in range(0,len(x)): if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9): list_new.append(x.iloc[row,1]) list.append(sum(list_new)) results = gbz.apply(f) The output result should be something like this: a c 0 1 12 1 2 27 2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby. z.query('4 < b < 9').groupby('a', as_index=False).c.sum() which yields a c 0 1 12 1 2 27 2 3 15
Use In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum() Out[2379]: a c 0 1 12 1 2 27 2 3 15 Or In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum() Out[2384]: a c 0 1 12 1 2 27 2 3 15
You could also groupby first. z = z.groupby('a').apply(lambda x: x.loc[x['b']\ .between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c') z a c 0 1 12 1 2 27 2 3 15
Or you can use z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\ .reset_index(name='c') Out[775]: a c 0 1 12 1 2 27 2 3 15
In Pandas, how to operate between columns in max perfornace
I have the following df: usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum 0 9 1 50 7 1728 1 3 1 43 3 1331 2 6 1 98 9 216 3 4 1 10 6 64 4 9 2 64 32 343 5 12 3 45 43 1000 6 8 3 87 76 512 7 9 3 16 3 1200 What I'm trying to do is: For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly). In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT. I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4). If the ratio is below 0.8 than I want to drop the row: index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop! index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay! The same condition also will be applied of LoginDaysSumLastMonth. So for the example the result will be: usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum 0 9 1 50 7 1728 1 3 1 43 3 1331 2 6 1 98 9 216 3 4 1 10 6 64 5 12 3 45 43 1000 6 8 3 87 76 512 7 9 3 16 3 1200 Now here's the snag- performance is critical. I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :( My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^): df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum']) UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'}) UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2] UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True) UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True) UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio') UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8] UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2') UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8] Would very much appreciate any suggestions on how to do it Thank you
I believe this is what you need: # Put the index as a regular column data = data.reset_index() # Find greates LoginDaysSum for each clienthostid agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first() # Collect greates LoginDaysSum for each usersidid agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first() # Join both previous aggregations joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max') # Compute ratios joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days'] joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth'] # Select index values that do not meet the required criteria rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index'] # Restore index and remove the selected rows data = data.set_index('index').drop(rem_idx) The result in data is: usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum index 0 9 1 50 7 1728 1 3 1 43 3 1331 2 6 1 98 9 216 3 4 1 10 6 64 5 12 3 45 43 1000 6 8 3 87 76 512 7 9 3 16 3 1200
How to count row-by-row with a condition
I want to count STATE and ID columns row-by-row in my DataFrame, but I am getting a KeyError. My final code is below. For each ID number (1 to 12) I want to count the state changes. Here is my data, and I have thousands of these data. #this code works for state column chars = "TGC-" nums = {} for char in chars: s = df["STATE"] A = s.str.contains("A:" + char) num = A.value_counts(sort=True) nums[char] = num ATnum = nums["T"] AGnum = nums["G"] ACnum= nums["C"] A_num= nums["-"] ATnum Out[26]: False 51919 True 1248 dtype: int64 # and this one works for id column pt = df.sort("ID")["ID"] pt_num=pt.value_counts() pt_values= pt_num.order() pt_index= pt_num.sort_index() #these are the total numbers of each id datas pt_num Out[27]: 10 5241 6 5144 11 4561 2 4439 3 4346 5 4284 9 4244 12 4218 7 4217 1 4210 4 4199 8 4064 dtype: int64 # i combine both ID and STATE columns and try to read row-by-row draft Out[21]: ID STATE 0 11 chr1:100154376:G:A 1 2 chr1:100177723:C:T 2 9 chr1:100177723:C:T 3 1 chr1:100194200:-:AA 4 8 chr1:10032249:A:G 5 2 chr1:100340787:G:A 6 1 chr1:100349757:A:G 7 3 chr1:10041186:C:A 8 10 chr1:100476986:G:C 9 4 chr1:100572459:C:T 10 5 chr1:100572459:C:T 11 2 chr1:100671861:T:- 12 4 chr1:1021390:C:A 13 5 chr1:10228220:G:C 14 3 chr1:1026913:C:T 15 4 chr1:1026913:C:T ... ... ... 53152 6 chrY:21618583:G:C 53153 5 chrY:24443836:T:G 53154 6 chrY:24443836:T:G 53155 8 chrY:24443836:T:G 53156 10 chrY:24443836:T:G 53157 12 chrY:24443836:T:G 53158 3 chrY:5605924:C:T 53159 2 chrY:6932151:G:A 53160 10 chrY:7224175:G:T 53161 2 chrY:9197998:C:T 53162 3 chrY:9197998:C:T 53163 4 chrY:9197998:C:T 53164 11 chrY:9197998:C:T 53165 12 chrY:9197998:C:T 53166 11 chrY:9304866:G:A [53167 rows x 2 columns] draft= df[["ID", "STATE" ]] chars = "TGC-" number = {} d = draft for i in d["ID"]: if i==1: for item in chars: At = d["STATE"].str.contains("A:" + item) num = At.value_counts(sort=True) number[item] = num ATn=number["T"] AGn=number["G"] ACn=number["C"] A_n=number["-"] KeyError: 'G' In total, what I want to do is, for example, ID 1 has 4210 rows, I want to determine how many of these have a state of A:T, A:G, A:C and A:-. Where am I going wrong?