How to count row-by-row with a condition - python

I want to count STATE and ID columns row-by-row in my DataFrame, but I am getting a KeyError. My final code is below. For each ID number (1 to 12) I want to count the state changes. Here is my data, and I have thousands of these data.
#this code works for state column
chars = "TGC-"
nums = {}
for char in chars:
s = df["STATE"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums["T"]
AGnum = nums["G"]
ACnum= nums["C"]
A_num= nums["-"]
ATnum
Out[26]:
False 51919
True 1248
dtype: int64
# and this one works for id column
pt = df.sort("ID")["ID"]
pt_num=pt.value_counts()
pt_values= pt_num.order()
pt_index= pt_num.sort_index()
#these are the total numbers of each id datas
pt_num
Out[27]:
10 5241
6 5144
11 4561
2 4439
3 4346
5 4284
9 4244
12 4218
7 4217
1 4210
4 4199
8 4064
dtype: int64
# i combine both ID and STATE columns and try to read row-by-row
draft
Out[21]:
ID STATE
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
11 2 chr1:100671861:T:-
12 4 chr1:1021390:C:A
13 5 chr1:10228220:G:C
14 3 chr1:1026913:C:T
15 4 chr1:1026913:C:T
... ... ...
53152 6 chrY:21618583:G:C
53153 5 chrY:24443836:T:G
53154 6 chrY:24443836:T:G
53155 8 chrY:24443836:T:G
53156 10 chrY:24443836:T:G
53157 12 chrY:24443836:T:G
53158 3 chrY:5605924:C:T
53159 2 chrY:6932151:G:A
53160 10 chrY:7224175:G:T
53161 2 chrY:9197998:C:T
53162 3 chrY:9197998:C:T
53163 4 chrY:9197998:C:T
53164 11 chrY:9197998:C:T
53165 12 chrY:9197998:C:T
53166 11 chrY:9304866:G:A
[53167 rows x 2 columns]
draft= df[["ID", "STATE" ]]
chars = "TGC-"
number = {}
d = draft
for i in d["ID"]:
if i==1:
for item in chars:
At = d["STATE"].str.contains("A:" + item)
num = At.value_counts(sort=True)
number[item] = num
ATn=number["T"]
AGn=number["G"]
ACn=number["C"]
A_n=number["-"]
KeyError: 'G'
In total, what I want to do is, for example, ID 1 has 4210 rows, I want to determine how many of these have a state of A:T, A:G, A:C and A:-.
Where am I going wrong?

Related

Parsing a string line by line into arrays

I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:
C
1 S 1 6665.000000 0.363803 ( 0.000692)
1 S 2 1000.000000 0.675392 ( 0.005329)
1 S 3 228.000000 1.132301 ( 0.027077)
1 S 4 64.710000 1.654004 ( 0.101718)
1 S 5 21.060000 1.924978 ( 0.274740)
1 S 6 7.495000 1.448149 ( 0.448564)
1 S 7 2.797000 0.439427 ( 0.285074)
1 S 8 0.521500 0.006650 ( 0.015204)
1 S 9 0.159600 -0.000574 ( -0.003191)
2 S 10 6665.000000 -0.076756 ( -0.000146)
2 S 11 1000.000000 -0.146257 ( -0.001154)
2 S 12 228.000000 -0.239407 ( -0.005725)
2 S 13 64.710000 -0.379069 ( -0.023312)
2 S 14 21.060000 -0.448104 ( -0.063955)
2 S 15 7.495000 -0.484201 ( -0.149981)
2 S 16 2.797000 -0.196168 ( -0.127262)
2 S 17 0.521500 0.238162 ( 0.544529)
2 S 18 0.159600 0.104468 ( 0.580496)
3 S 19 0.159600 0.179964 ( 1.000000)
4 P 20 9.439000 0.898722 ( 0.038109)
4 P 21 2.002000 0.711071 ( 0.209480)
4 P 22 0.545600 0.339917 ( 0.508557)
4 P 23 0.151700 0.063270 ( 0.468842)
5 P 24 0.151700 0.134950 ( 1.000000)
6 D 25 0.550000 0.578155 ( 1.000000)
C
7 S 26 6665.000000 0.363803 ( 0.000692)
7 S 27 1000.000000 0.675392 ( 0.005329)
7 S 28 228.000000 1.132301 ( 0.027077)
7 S 29 64.710000 1.654004 ( 0.101718)
7 S 30 21.060000 1.924978 ( 0.274740)
7 S 31 7.495000 1.448149 ( 0.448564)
7 S 32 2.797000 0.439427 ( 0.285074)
7 S 33 0.521500 0.006650 ( 0.015204)
7 S 34 0.159600 -0.000574 ( -0.003191)
8 S 35 6665.000000 -0.076756 ( -0.000146)
8 S 36 1000.000000 -0.146257 ( -0.001154)
8 S 37 228.000000 -0.239407 ( -0.005725)
8 S 38 64.710000 -0.379069 ( -0.023312)
8 S 39 21.060000 -0.448104 ( -0.063955)
8 S 40 7.495000 -0.484201 ( -0.149981)
8 S 41 2.797000 -0.196168 ( -0.127262)
8 S 42 0.521500 0.238162 ( 0.544529)
8 S 43 0.159600 0.104468 ( 0.580496)
9 S 44 0.159600 0.179964 ( 1.000000)
10 P 45 9.439000 0.898722 ( 0.038109)
10 P 46 2.002000 0.711071 ( 0.209480)
10 P 47 0.545600 0.339917 ( 0.508557)
10 P 48 0.151700 0.063270 ( 0.468842)
11 P 49 0.151700 0.134950 ( 1.000000)
12 D 50 0.550000 0.578155 ( 1.000000)
This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.
Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.
import pandas as pd
rows = []
for line in open('x.txt'):
parts = line.strip().split()
if len(parts) == 1:
print(parts[0])
counter1 = 0
counter2 = 0
elif len(parts) == 2:
counter1 += 1
shell = (counter1, parts[0])
else:
counter2 += 1
rows.append( shell + (counter2, float(parts[1]), float(parts[2])) )
df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)
Output:
CARBON
basisnum basistype primitive energy delta
0 1 S 1 6665.0000 0.000692
1 1 S 2 1000.0000 0.005329
2 1 S 3 228.0000 0.027077
3 1 S 4 64.7100 0.101718
4 1 S 5 21.0600 0.274740
5 1 S 6 7.4950 0.448564
6 1 S 7 2.7970 0.285074
7 1 S 8 0.5215 0.015204
8 1 S 9 0.1596 -0.003191
9 2 S 10 6665.0000 -0.000146
10 2 S 11 1000.0000 -0.001154
11 2 S 12 228.0000 -0.005725
12 2 S 13 64.7100 -0.023312
13 2 S 14 21.0600 -0.063955
14 2 S 15 7.4950 -0.149981
15 2 S 16 2.7970 -0.127262
16 2 S 17 0.5215 0.544529
17 2 S 18 0.1596 0.580496
18 3 S 19 0.1596 1.000000
19 4 P 20 9.4390 0.038109
20 4 P 21 2.0020 0.209480
21 4 P 22 0.5456 0.508557
22 4 P 23 0.1517 0.468842
23 5 P 24 0.1517 1.000000
24 6 D 25 0.5500 1.000000
Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.
import basis_set_exchange as bse
import pandas as pd
basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]
print(basis, '\n')
buf = basis.split('\n')
buf.pop(2)
shellNumber = 0
shellType = ''
rows = []
for line in buf:
parts = line.strip().split()
if (len(parts) == 2):
shellType = parts[0]
shellNumber += 1
elif (len(parts) == 3):
rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))
df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)
Output:
CARBON
S 9
1 6.665000E+03 6.920000E-04
2 1.000000E+03 5.329000E-03
3 2.280000E+02 2.707700E-02
4 6.471000E+01 1.017180E-01
5 2.106000E+01 2.747400E-01
6 7.495000E+00 4.485640E-01
7 2.797000E+00 2.850740E-01
8 5.215000E-01 1.520400E-02
9 1.596000E-01 -3.191000E-03
S 9
1 6.665000E+03 -1.460000E-04
2 1.000000E+03 -1.154000E-03
3 2.280000E+02 -5.725000E-03
4 6.471000E+01 -2.331200E-02
5 2.106000E+01 -6.395500E-02
6 7.495000E+00 -1.499810E-01
7 2.797000E+00 -1.272620E-01
8 5.215000E-01 5.445290E-01
9 1.596000E-01 5.804960E-01
S 1
1 1.596000E-01 1.000000E+00
P 4
1 9.439000E+00 3.810900E-02
2 2.002000E+00 2.094800E-01
3 5.456000E-01 5.085570E-01
4 1.517000E-01 4.688420E-01
P 1
1 1.517000E-01 1.000000E+00
D 1
1 5.500000E-01 1.0000000
SHELL TYPE SHELL NO EXPONENT CONTR COEF
0 S 1 6665.0000 0.000692
1 S 1 1000.0000 0.005329
2 S 1 228.0000 0.027077
3 S 1 64.7100 0.101718
4 S 1 21.0600 0.274740
5 S 1 7.4950 0.448564
6 S 1 2.7970 0.285074
7 S 1 0.5215 0.015204
8 S 1 0.1596 -0.003191
9 S 2 6665.0000 -0.000146
10 S 2 1000.0000 -0.001154
11 S 2 228.0000 -0.005725
12 S 2 64.7100 -0.023312
13 S 2 21.0600 -0.063955
14 S 2 7.4950 -0.149981
15 S 2 2.7970 -0.127262
16 S 2 0.5215 0.544529
17 S 2 0.1596 0.580496
18 S 3 0.1596 1.000000
19 P 4 9.4390 0.038109
20 P 4 2.0020 0.209480
21 P 4 0.5456 0.508557
22 P 4 0.1517 0.468842
23 P 5 0.1517 1.000000
24 D 6 0.5500 1.000000
After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Reassign index of a dataframe

I have the following dataframe:
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
Name: Passengers, dtype: float64
As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following:
plt.figure(figsize=(15,5))
plt.plot(esta2,color='orange')
plt.show()
I would like to see a continuous line from 1 to 24.
esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).
quite simply :
df.index = [i for i in range(1,len(df.index)+1)]
df.index.name = 'Month'
print(df)
Val
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
13 -0.075844
14 -0.089111
15 0.042705
16 0.002147
17 -0.010528
18 0.109443
19 0.198334
20 0.209830
21 0.075139
22 -0.062405
23 -0.211774
24 -0.109167
Just reassign the index:
df.index = pd.Index(range(1, len(df) + 1), name='Month')

Replacing the first string character in python 3

i have a pandas series like this:
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
i want to get rid of the dollar sign so i can convert the values to numeric. I made a function in order to do this:
def strip_dollar(series):
for number in dollar:
if number[0] == '$':
number[0].replace('$', ' ')
return dollar
This function is returning the original series untouched, nothing changes, and i don't know why.
Any ideas about how to get this right?
Thanks in advance
Use lstrip and convert to floats:
s = s.str.lstrip('$').astype(float)
print (s)
0 233.94
1 214.14
2 208.74
3 232.14
4 187.15
5 262.73
6 176.35
7 266.33
8 174.55
9 221.34
10 199.74
11 228.54
12 228.54
13 196.15
14 269.93
15 257.33
16 246.53
17 226.74
Name: A, dtype: float64
Setup:
s = pd.Series(['$233.94', '$214.14', '$208.74', '$232.14', '$187.15', '$262.73', '$176.35', '$266.33', '$174.55', '$221.34', '$199.74', '$228.54', '$228.54', '$196.15', '$269.93', '$257.33', '$246.53', '$226.74'])
print (s)
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
dtype: object
Using str.replace("$", "")
Ex:
import pandas as pd
df = pd.DataFrame({"Col" : ["$233.94", "$214.14"]})
df["Col"] = pd.to_numeric(df["Col"].str.replace("$", ""))
print(df)
Output:
Col
0 233.94
1 214.14
CODE:
ser = pd.Series(data=['$123', '$234', '$232', '$6767'])
def rmDollar(x):
return x[1:]
serWithoutDollar = ser.apply(rmDollar)
serWithoutDollar
OUTPUT:
0 123
1 234
2 232
3 6767
dtype: object
Hope it helps!

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Categories

Resources