Parsing a string line by line into arrays

Parsing a string line by line into arrays - python

I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:
C
1 S 1 6665.000000 0.363803 ( 0.000692)
1 S 2 1000.000000 0.675392 ( 0.005329)
1 S 3 228.000000 1.132301 ( 0.027077)
1 S 4 64.710000 1.654004 ( 0.101718)
1 S 5 21.060000 1.924978 ( 0.274740)
1 S 6 7.495000 1.448149 ( 0.448564)
1 S 7 2.797000 0.439427 ( 0.285074)
1 S 8 0.521500 0.006650 ( 0.015204)
1 S 9 0.159600 -0.000574 ( -0.003191)
2 S 10 6665.000000 -0.076756 ( -0.000146)
2 S 11 1000.000000 -0.146257 ( -0.001154)
2 S 12 228.000000 -0.239407 ( -0.005725)
2 S 13 64.710000 -0.379069 ( -0.023312)
2 S 14 21.060000 -0.448104 ( -0.063955)
2 S 15 7.495000 -0.484201 ( -0.149981)
2 S 16 2.797000 -0.196168 ( -0.127262)
2 S 17 0.521500 0.238162 ( 0.544529)
2 S 18 0.159600 0.104468 ( 0.580496)
3 S 19 0.159600 0.179964 ( 1.000000)
4 P 20 9.439000 0.898722 ( 0.038109)
4 P 21 2.002000 0.711071 ( 0.209480)
4 P 22 0.545600 0.339917 ( 0.508557)
4 P 23 0.151700 0.063270 ( 0.468842)
5 P 24 0.151700 0.134950 ( 1.000000)
6 D 25 0.550000 0.578155 ( 1.000000)
C
7 S 26 6665.000000 0.363803 ( 0.000692)
7 S 27 1000.000000 0.675392 ( 0.005329)
7 S 28 228.000000 1.132301 ( 0.027077)
7 S 29 64.710000 1.654004 ( 0.101718)
7 S 30 21.060000 1.924978 ( 0.274740)
7 S 31 7.495000 1.448149 ( 0.448564)
7 S 32 2.797000 0.439427 ( 0.285074)
7 S 33 0.521500 0.006650 ( 0.015204)
7 S 34 0.159600 -0.000574 ( -0.003191)
8 S 35 6665.000000 -0.076756 ( -0.000146)
8 S 36 1000.000000 -0.146257 ( -0.001154)
8 S 37 228.000000 -0.239407 ( -0.005725)
8 S 38 64.710000 -0.379069 ( -0.023312)
8 S 39 21.060000 -0.448104 ( -0.063955)
8 S 40 7.495000 -0.484201 ( -0.149981)
8 S 41 2.797000 -0.196168 ( -0.127262)
8 S 42 0.521500 0.238162 ( 0.544529)
8 S 43 0.159600 0.104468 ( 0.580496)
9 S 44 0.159600 0.179964 ( 1.000000)
10 P 45 9.439000 0.898722 ( 0.038109)
10 P 46 2.002000 0.711071 ( 0.209480)
10 P 47 0.545600 0.339917 ( 0.508557)
10 P 48 0.151700 0.063270 ( 0.468842)
11 P 49 0.151700 0.134950 ( 1.000000)
12 D 50 0.550000 0.578155 ( 1.000000)
This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.

Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.
import pandas as pd
rows = []
for line in open('x.txt'):
parts = line.strip().split()
if len(parts) == 1:
print(parts[0])
counter1 = 0
counter2 = 0
elif len(parts) == 2:
counter1 += 1
shell = (counter1, parts[0])
else:
counter2 += 1
rows.append( shell + (counter2, float(parts[1]), float(parts[2])) )
df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)
Output:
CARBON
basisnum basistype primitive energy delta
0 1 S 1 6665.0000 0.000692
1 1 S 2 1000.0000 0.005329
2 1 S 3 228.0000 0.027077
3 1 S 4 64.7100 0.101718
4 1 S 5 21.0600 0.274740
5 1 S 6 7.4950 0.448564
6 1 S 7 2.7970 0.285074
7 1 S 8 0.5215 0.015204
8 1 S 9 0.1596 -0.003191
9 2 S 10 6665.0000 -0.000146
10 2 S 11 1000.0000 -0.001154
11 2 S 12 228.0000 -0.005725
12 2 S 13 64.7100 -0.023312
13 2 S 14 21.0600 -0.063955
14 2 S 15 7.4950 -0.149981
15 2 S 16 2.7970 -0.127262
16 2 S 17 0.5215 0.544529
17 2 S 18 0.1596 0.580496
18 3 S 19 0.1596 1.000000
19 4 P 20 9.4390 0.038109
20 4 P 21 2.0020 0.209480
21 4 P 22 0.5456 0.508557
22 4 P 23 0.1517 0.468842
23 5 P 24 0.1517 1.000000
24 6 D 25 0.5500 1.000000

Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.
import basis_set_exchange as bse
import pandas as pd
basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]
print(basis, '\n')
buf = basis.split('\n')
buf.pop(2)
shellNumber = 0
shellType = ''
rows = []
for line in buf:
parts = line.strip().split()
if (len(parts) == 2):
shellType = parts[0]
shellNumber += 1
elif (len(parts) == 3):
rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))
df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)
Output:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
SHELL TYPE SHELL NO EXPONENT CONTR COEF
0 S 1 6665.0000 0.000692
1 S 1 1000.0000 0.005329
2 S 1 228.0000 0.027077
3 S 1 64.7100 0.101718
4 S 1 21.0600 0.274740
5 S 1 7.4950 0.448564
6 S 1 2.7970 0.285074
7 S 1 0.5215 0.015204
8 S 1 0.1596 -0.003191
9 S 2 6665.0000 -0.000146
10 S 2 1000.0000 -0.001154
11 S 2 228.0000 -0.005725
12 S 2 64.7100 -0.023312
13 S 2 21.0600 -0.063955
14 S 2 7.4950 -0.149981
15 S 2 2.7970 -0.127262
16 S 2 0.5215 0.544529
17 S 2 0.1596 0.580496
18 S 3 0.1596 1.000000
19 P 4 9.4390 0.038109
20 P 4 2.0020 0.209480
21 P 4 0.5456 0.508557
22 P 4 0.1517 0.468842
23 P 5 0.1517 1.000000
24 D 6 0.5500 1.000000
After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.

Related

Reassign index of a dataframe

I have the following dataframe:
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
Name: Passengers, dtype: float64
As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following:
plt.figure(figsize=(15,5))
plt.plot(esta2,color='orange')
plt.show()
I would like to see a continuous line from 1 to 24.

esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).

quite simply :
df.index = [i for i in range(1,len(df.index)+1)]
df.index.name = 'Month'
print(df)
Val
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
13 -0.075844
14 -0.089111
15 0.042705
16 0.002147
17 -0.010528
18 0.109443
19 0.198334
20 0.209830
21 0.075139
22 -0.062405
23 -0.211774
24 -0.109167

Just reassign the index:
df.index = pd.Index(range(1, len(df) + 1), name='Month')

Compare preceding two rows with subsequent two rows of each group till last record

I had a question earlier which is deleted and now modified to a less verbose form for you to read easily.
I have a dataframe as given below
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['fake_flag'] = ''
I would like to fill values in column fake_flag based on the below rules
1) if preceding two rows are constant (ex:5,5) or decreasing (7,5), then pick the highest of the two rows. In this case, it is 7 from (7,5) and 5 from (5,5)
2) Check whether the current row is greater than the output from rule 1 by 3 or more points (>=3) and it repeats in another (next) row (2 occurrences of same value). It can be 8/gt 8(if rule 1 output is 5). ex: (8 in row n,8 in row n+1 or 10 in row n,10 in row n+1) If yes, then key in fake VAC in the fake_flag column
This is what I tried
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]): # rule 1 check
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]): # rule 2 check
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
This check should happen for all records (one by one) for each subject_id. I have a dataset which has million records. Any efficient and elegant solution is helpful. I can't run a loop over million records.
I expect my output to be like as shown below
subject_id = 1
subject_id = 2

import pandas as pd
df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]})
df['shift1']=df['PEEP'].shift(1)
df['shift2']=df['PEEP'].shift(2)
df['fake_flag'] = np.where((df['shift1'] ==df['shift2']) | (df['shift1'] < df['shift2']), 'fake VAC', '')
df.drop(['shift1','shift2'],axis=1)
Output
0 1 1 7
1 1 2 5
2 1 3 10 fake VAC
3 1 4 10
4 1 5 11 fake VAC
5 1 6 11
6 1 7 14 fake VAC
7 1 8 14
8 1 9 17 fake VAC
9 1 10 17
10 1 11 21 fake VAC
11 1 12 21
12 1 13 23 fake VAC
13 1 14 23
14 1 15 25 fake VAC
15 1 16 25
16 1 17 22 fake VAC
17 1 18 20 fake VAC
18 1 19 26 fake VAC
19 1 20 26
20 2 1 5 fake VAC
21 2 2 7 fake VAC
22 2 3 8
23 2 4 8
24 2 5 9 fake VAC
25 2 6 9
26 2 7 13 fake VAC

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15

It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15

Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15

You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15

Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you

I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

How to count row-by-row with a condition

I want to count STATE and ID columns row-by-row in my DataFrame, but I am getting a KeyError. My final code is below. For each ID number (1 to 12) I want to count the state changes. Here is my data, and I have thousands of these data.
#this code works for state column
chars = "TGC-"
nums = {}
for char in chars:
s = df["STATE"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums["T"]
AGnum = nums["G"]
ACnum= nums["C"]
A_num= nums["-"]
ATnum
Out[26]:
False 51919
True 1248
dtype: int64
# and this one works for id column
pt = df.sort("ID")["ID"]
pt_num=pt.value_counts()
pt_values= pt_num.order()
pt_index= pt_num.sort_index()
#these are the total numbers of each id datas
pt_num
Out[27]:
10 5241
6 5144
11 4561
2 4439
3 4346
5 4284
9 4244
12 4218
7 4217
1 4210
4 4199
8 4064
dtype: int64
# i combine both ID and STATE columns and try to read row-by-row
draft
Out[21]:
ID STATE
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
11 2 chr1:100671861:T:-
12 4 chr1:1021390:C:A
13 5 chr1:10228220:G:C
14 3 chr1:1026913:C:T
15 4 chr1:1026913:C:T
... ... ...
53152 6 chrY:21618583:G:C
53153 5 chrY:24443836:T:G
53154 6 chrY:24443836:T:G
53155 8 chrY:24443836:T:G
53156 10 chrY:24443836:T:G
53157 12 chrY:24443836:T:G
53158 3 chrY:5605924:C:T
53159 2 chrY:6932151:G:A
53160 10 chrY:7224175:G:T
53161 2 chrY:9197998:C:T
53162 3 chrY:9197998:C:T
53163 4 chrY:9197998:C:T
53164 11 chrY:9197998:C:T
53165 12 chrY:9197998:C:T
53166 11 chrY:9304866:G:A
[53167 rows x 2 columns]
draft= df[["ID", "STATE" ]]
chars = "TGC-"
number = {}
d = draft
for i in d["ID"]:
if i==1:
for item in chars:
At = d["STATE"].str.contains("A:" + item)
num = At.value_counts(sort=True)
number[item] = num
ATn=number["T"]
AGn=number["G"]
ACn=number["C"]
A_n=number["-"]
KeyError: 'G'
In total, what I want to do is, for example, ID 1 has 4210 rows, I want to determine how many of these have a state of A:T, A:G, A:C and A:-.
Where am I going wrong?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a string line by line into arrays - python

Related

Reassign index of a dataframe

Compare preceding two rows with subsequent two rows of each group till last record

Groupby on condition and calculate sum of subgroups

In Pandas, how to operate between columns in max perfornace

How to count row-by-row with a condition

Categories

Resources