Copy Pandas DataFrame into multiple files by Value Range

Copy Pandas DataFrame into multiple files by Value Range - python

I have a DataFrame, lets say 3000x3000 with int values from 0 to 10 and I want to break it down into categories and save into separate files.
Categories should be something like 0-3, 4-5, 5-10 for example.
As a result I want to get 3 files of the same shape but only with relevant values per category and these values should stay at the original positions.
At first I thought to copy df for each category and use replace to remove all irrelevant values, but it doesn't sound right.
Hope this is not very confusing.
df example:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 7 0
2 0 0 2 3 0 0 0 0 6 7
3 0 0 2 3 0 0 0 0 9 6
4 0 0 0 1 0 0 5 4 8 7
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
as the result I want 3 dataframes:
cat1:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
cat2:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 5 4 0 0
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
cat3:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 7 0
2 0 0 0 0 0 0 0 0 6 7
3 0 0 0 0 0 0 0 0 9 6
4 0 0 0 0 0 0 0 0 8 7
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0

You want where
df1 = df.where((df > 0) & (df <=3), 0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
You can write similar logic for df2 and df3

Related

Python: Creating an adjacency matrix from a dataframe

I have the following data frame:
Company Firm
125911 1
125911 2
32679 3
32679 5
32679 5
32679 8
32679 10
32679 12
43805 14
67734 8
67734 9
67734 10
67734 10
67734 11
67734 12
67734 13
74240 4
74240 6
74240 7
Where basically the firm makes an investment into the company at a specific year which in this case is the same year for all companies. What I want to do in python is to create a simple adjacency matrix with only 0's and 1's. 1 if two firms has made an investment into the same company. So even if firm 10 and 8 for example have invested in two different firms at the same it will still be a 1.
The resulting matrix I am looking for looks like:
Firm 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have seen similar questions where you can use crosstab, however in that case each company will only have one row with all the firms in different columns instead. So I am wondering what the best and most efficient way to tackle this specific problem is? Any help is greatly appreciated.

dfs = []
for s in df.groupby("Company").agg(list).values:
dfs.append(pd.DataFrame(index=set(s[0]), columns=set(s[0])).fillna(1))
out = pd.concat(dfs).groupby(level=0).sum().gt(0).astype(int)
np.fill_diagonal(out.values, 0)
print(out)
Prints:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0

dfm = df.merge(df, on="Company").query("Firm_x != Firm_y")
out = pd.crosstab(dfm['Firm_x'], dfm['Firm_y'])
>>> out
Firm_y 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Firm_x
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 4 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 2 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 5 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 2 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 1 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Creating week flags from DOW

I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.

Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0

Accessing the values of surrounding cells in a dataframe without using a loop

I am looking for a way to calculate for each cell in a dataframe the sum of the values of all surrounding cells (including in diagonal), without using a loop.
I have come up with something that looks like that, but it does not include diagonals, and as soon as I include diagonals some cells are counted too many times.
# Initializing matrix a
columns = [x for x in range(10)]
rows = [x for x in range(10)]
matrix = pd.DataFrame(index=rows, columns=columns).fillna(0)
# filling up with mock values
matrix.iloc[5,4] = 1
matrix.iloc[5,5] = 1
matrix.iloc[5,6] = 1
matrix.iloc[4,5] = 1
matrix1 = matrix.apply(lambda x: x.shift(1)).fillna(0)
matrix2 = matrix.T.apply(lambda x: x.shift(1)).T.fillna(0)
matrix3 = matrix.apply(lambda x: x.shift(-1)).fillna(0)
matrix4 = matrix.T.apply(lambda x: x.shift(-1)).T.fillna(0)
matrix_out = matrix1 + matrix2 + matrix3 + matrix4
To be more precise, I plan on populating the dataframe only with 0 or 1 values. The test above is the following:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 1 1 1 1 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
The expected output for this input is:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 1 1 0 0 0
4 0 0 0 1 3 3 4 2 1 0
5 0 0 0 1 2 3 3 1 1 0
6 0 0 0 1 3 3 3 2 1 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
Am I in the right direction with this shift() function used within apply, or would you suggest doing otherwise?
Thanks a lot!

Seems like you need
def sum_diag(matrix):
return matrix.shift(1,axis=1).shift(1, axis=0) + matrix.shift(-1, axis=1).shift(1, axis=0) + matrix.shift(1, axis=1).shift(-1) + matrix.shift(-1, axis=1).shift(-1, axis=0)
def sum_nxt(matrix):
return matrix.shift(-1) + matrix.shift(1) + matrix.shift(1,axis=1) + matrix.shift(-1, axis=1)
final = sum_nxt(matrix) + sum_diag(matrix)
Outputs
print(final.fillna(0).astype(int))
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 1 1 0 0 0
4 0 0 0 1 3 3 4 2 1 0
5 0 0 0 1 2 3 3 1 1 0
6 0 0 0 1 2 3 3 2 1 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
Notice that you might want to add .fillna(0) to all shift operations to ensure the borders behave well too, if numbers in the borders are not zero.

How can I optimize this script so it does not take a week to finish the task it is doing? (Used BASH PARALLEL too.)

I have a directory full of 60,000 files that are named by their molid. I have a second file in CSV format that has molids in column 1 with their respective CHEMBLID in column 2. I need to match the file name molid in the directory with a molid in the CSV file. If a match is found, the chemblid is added to the file (the file is rewritten to have the chemblid included). I am also using RDKit to calculate some properties I need written to the modified file too. I need to find a way to optimize this since I have to run it on 2 million files soon.
The way I am interacting with arg parse is by listing all of the molid.sdf files in my directory with the bash parallel command.
The csv file looks like:
molid,chembl
319855,CHEMBLtest
187481,CHEMBL1527718
https://www.dropbox.com/s/6ynd9vbwwf6lqka/output_2.csv?dl=0
The file that needs to be modified looks like:
298512 from gamess16 based ATb pipeline
OpenBabel06141815083D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0 0 0 0
1 17 1 0 0 0 0
2 20 1 1 0 0 0
2 1 1 0 0 0 0
3 21 1 0 0 0 0
3 2 1 0 0 0 0
4 2 1 0 0 0 0
4 22 1 0 0 0 0
5 6 1 0 0 0 0
5 4 1 0 0 0 0
6 7 1 0 0 0 0
7 26 1 0 0 0 0
7 8 2 0 0 0 0
8 27 1 0 0 0 0
9 8 1 0 0 0 0
9 10 2 0 0 0 0
10 28 1 0 0 0 0
11 10 1 0 0 0 0
11 12 1 0 0 0 0
12 31 1 0 0 0 0
12 30 1 0 0 0 0
13 11 2 0 0 0 0
14 15 2 0 0 0 0
14 13 1 0 0 0 0
15 9 1 0 0 0 0
16 6 2 0 0 0 0
16 15 1 0 0 0 0
18 1 1 0 0 0 0
23 4 1 0 0 0 0
24 5 1 0 0 0 0
25 5 1 0 0 0 0
29 12 1 0 0 0 0
32 13 1 0 0 0 0
33 14 1 0 0 0 0
34 16 1 0 0 0 0
M END
> <molid>
298512
$$$$
https://www.dropbox.com/s/9r9kandkbahgexj/298512.sdf?dl=0
And a modified file with how the current script works looks like:
298512 from gamess16 based ATb pipeline
RDKit 3D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0
1 17 1 0
2 20 1 1
2 1 1 0
3 21 1 0
3 2 1 0
4 2 1 0
4 22 1 0
5 6 1 0
5 4 1 0
6 7 1 0
7 26 1 0
7 8 2 0
8 27 1 0
9 8 1 0
9 10 2 0
10 28 1 0
11 10 1 0
11 12 1 0
12 31 1 0
12 30 1 0
13 11 2 0
14 15 2 0
14 13 1 0
15 9 1 0
16 6 2 0
16 15 1 0
18 1 1 0
23 4 1 0
24 5 1 0
25 5 1 0
29 12 1 0
32 13 1 0
33 14 1 0
34 16 1 0
M END
> <molid> (1)
298512
> <CHEMBLID> (1)
CHEMBL3278713
> <i_user_TOTAL_CHARGE> (1)
0
> <SMILES> (1)
'[H]OC([H])(C([H])([H])[H])C([H])([H])C([H])([H])C1C([H])=C([H])C2=C([H])C(=C([H])C([H])=C2C=1[H])C([H])([H])[H]'
> <InChI> (1)
'InChI=1S/C15H18O/c1-11-3-7-15-10-13(5-4-12(2)16)6-8-14(15)9-11/h3,6-10,12,16H,4-5H2,1-2H3/t12-/m0/s1'
$$$$
https://www.dropbox.com/s/dfcmiv7d298s1fl/298512.chembl.sdf?dl=0
import os,shutil,csv
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-molid", help="molids from file names", type=str)
args = parser.parse_args()
print(args)
fn = args.molid
print(fn)
suppl = Chem.SDMolSupplier(fn, removeHs=False, sanitize=False)
ms = [x for x in suppl if x is not None] #sanity check loop to make sure the files were read
print("This is the number of entries read in")
print(len(ms))
print(len(suppl))
w=Chem.SDWriter('totaltest_with_chembl.sdf') #writes new file with all of the chemblid
new_files_with_chembl_id=os.path.splitext(fn)[0]
w=Chem.SDWriter(new_files_with_chembl_id+'.chembl.sdf')
molid_chemblid = open('output_2.csv','r')
csv_f = csv.reader(molid_chemblid)
header = next(csv_f)
molidIndex = header.index("molid")
chemblidIndex = header.index("chembl")
molid_chemblidList = []
for line in csv_f:
molid = line[molidIndex]
# print(molid)
chembl = line[chemblidIndex]
# print(chembl)
molid_chemblidList.append([molid,chembl])
for m in suppl: #molecule in MoleculeSet
print(m)
atbname=m.GetProp("_Name")
fillmein=atbname.split( )[0]
moleculeCharge=Chem.GetFormalCharge(m)
smiles_string=Chem.MolToSmiles(m)
inchi_string=Chem.MolToInchi(m)
print("molecularCharge")
print(moleculeCharge)
print("smile_string")
print(smiles_string)
print("inchi string")
print(inchi_string)
if fillmein == molid:
print(chembl)
print(chembl)
print(line)
print("this is line in molid_chemblid",line)
m.SetProp("CHEMBLID",chembl)
m.SetProp("i_user_TOTAL_CHARGE",repr(moleculeCharge))
m.SetProp("SMILES",repr(smiles_string))
m.SetProp("InChI",repr(inchi_string))
w.write(m)

The molid in the CSV file sounds like a unique key. Read the CSV file into a map/associative array where the molid is the key and the rest of the row is the value, parsed or not as you need it. Python has builtin CSV parsers with import csv.
Then loop just once over the files, find the chemblid by looking up the molid from the file name in the map.
This reduces the overall effort to roughly k*N where N is the number of files and molids and k is a small number.
Your algorithm has a loop within a loop which makes it N*N complexity. This indeed would take some time with N=2 million :-)
2 Million files is still a lot and may take between a few hours and a few days, depending how big the files are, how fast your disk access is and all that. Running a few threads in parallel will then help, until the I/O saturates. But test first, since implementing a parallel approach can get complicated. If you have to get through this only once, it might be ok to wait a bit longer.

How to convert a dataframe to a 3D ndarray

I want to convert a dataframe to a 3D np.array
I have tried df = df.as_matrix(). But it is a 2D matrix.
The Dataframe is df:
days 0 1 2 3 4 5 6 7 8 9 ... 20 21 \
enrollment_id event ...
1 access 0 0 3 0 0 0 0 8 0 4 ... 20 0
discussion 0 0 0 0 0 0 0 0 0 0 ... 0 0
navigate 0 0 1 0 0 0 0 4 0 1 ... 0 0
page_close 0 0 1 0 0 0 0 6 0 2 ... 17 0
problem 0 0 8 0 0 0 0 6 0 0 ... 0 0
video 0 0 0 0 0 0 0 0 0 0 ... 14 0
wiki 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 access 7 0 0 0 2 0 0 11 0 0 ... 0 0
discussion 0 0 0 0 0 0 0 0 0 0 ... 0 0
navigate 4 0 0 0 1 0 0 1 0 0 ... 0 0
page_close 2 0 0 0 0 0 0 5 0 0 ... 0 0
problem 14 0 0 0 0 0 0 13 0 0 ... 0 0
video 1 0 0 0 0 0 0 0 0 0 ... 0 0
wiki 0 0 0 0 0 0 0 0 0 0 ... 0 0

As an array is just the bare values of the dataframe, simply do
arr = df.values
If the shape is not what you want, you can play around with the NumPy reshape method/function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Copy Pandas DataFrame into multiple files by Value Range - python

Related

Python: Creating an adjacency matrix from a dataframe

Creating week flags from DOW

Accessing the values of surrounding cells in a dataframe without using a loop

How can I optimize this script so it does not take a week to finish the task it is doing? (Used BASH PARALLEL too.)

How to convert a dataframe to a 3D ndarray

Categories

Resources