Python: Creating an adjacency matrix from a dataframe - python

I have the following data frame:
Company Firm
125911 1
125911 2
32679 3
32679 5
32679 5
32679 8
32679 10
32679 12
43805 14
67734 8
67734 9
67734 10
67734 10
67734 11
67734 12
67734 13
74240 4
74240 6
74240 7
Where basically the firm makes an investment into the company at a specific year which in this case is the same year for all companies. What I want to do in python is to create a simple adjacency matrix with only 0's and 1's. 1 if two firms has made an investment into the same company. So even if firm 10 and 8 for example have invested in two different firms at the same it will still be a 1.
The resulting matrix I am looking for looks like:
Firm 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have seen similar questions where you can use crosstab, however in that case each company will only have one row with all the firms in different columns instead. So I am wondering what the best and most efficient way to tackle this specific problem is? Any help is greatly appreciated.

dfs = []
for s in df.groupby("Company").agg(list).values:
dfs.append(pd.DataFrame(index=set(s[0]), columns=set(s[0])).fillna(1))
out = pd.concat(dfs).groupby(level=0).sum().gt(0).astype(int)
np.fill_diagonal(out.values, 0)
print(out)
Prints:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0

dfm = df.merge(df, on="Company").query("Firm_x != Firm_y")
out = pd.crosstab(dfm['Firm_x'], dfm['Firm_y'])
>>> out
Firm_y 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Firm_x
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 4 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 2 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 5 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 2 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 1 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Related

generate an array with 0 and 1

I have two columns in a data frame ( Startpoint and endpoint )
I would like to generate an array with 1 for the duration between the two pints and 0 otherwise
For example :
the total number of increments is 200,
df = pd.DataFrame({'Startpoint': [ 100 , 50, 40 , 75 , 52 , 43, 90 , 48, 56 ,20 ], 'endpoint': [ 150, 70, 80, 90, 140, 160 ,170 , 120 , 135, 170 ]})
df
I want to generate an array (200,1) with 0 & 1. It will have 1 if it is in the range between 50 and 100 and 0 otherisw.
Thank you,
You can use numpy broadcasting to create the desired array in a vectorized way:
rng = np.arange(200)
out = ((df['Startpoint'].to_numpy()[:, None] <= rng) & (rng < df['endpoint'].to_numpy()[:, None])).astype(int)
Output:
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
To see that it's indeed the desired output, we check dimension and number of 1s in each row:
>>> out.shape
(10, 200)
>>> out.sum(axis=1)
array([ 50, 20, 40, 15, 88, 117, 80, 72, 79, 150])
Try this:
import numpy as np
import pandas as pd
increment = 200
array = np.zeros((10, increment), dtype = 'int')
for i in range(len(df)):
array[i, df['Startpoint'][i]:df['endpoint'][i]] = 1
Output(Note: I am printing each row entirely. This is only one array(shape = (10,200)) not 10 arrays each.):
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Copy Pandas DataFrame into multiple files by Value Range

I have a DataFrame, lets say 3000x3000 with int values from 0 to 10 and I want to break it down into categories and save into separate files.
Categories should be something like 0-3, 4-5, 5-10 for example.
As a result I want to get 3 files of the same shape but only with relevant values per category and these values should stay at the original positions.
At first I thought to copy df for each category and use replace to remove all irrelevant values, but it doesn't sound right.
Hope this is not very confusing.
df example:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 7 0
2 0 0 2 3 0 0 0 0 6 7
3 0 0 2 3 0 0 0 0 9 6
4 0 0 0 1 0 0 5 4 8 7
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
as the result I want 3 dataframes:
cat1:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
cat2:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 5 4 0 0
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
cat3:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 7 0
2 0 0 0 0 0 0 0 0 6 7
3 0 0 0 0 0 0 0 0 9 6
4 0 0 0 0 0 0 0 0 8 7
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
You want where
df1 = df.where((df > 0) & (df <=3), 0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
You can write similar logic for df2 and df3

Creating week flags from DOW

I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.
Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0

Pandas sum every other column by index where names, and index size changes

Here is my current dataframe named out
Date David_Added David_Removed Malik_Added Malik_Removed Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed
02/19/2019 3 1 39 41 1 6 14 24
02/18/2019 0 0 8 6 0 3 0 0
02/16/2019 0 0 0 0 0 0 0 0
02/15/2019 0 0 0 0 0 0 0 0
02/14/2019 0 0 0 0 0 0 0 0
02/13/2019 0 0 0 0 0 0 0 0
02/12/2019 0 0 0 0 0 0 0 0
02/11/2019 0 0 0 0 0 0 0 0
02/08/2019 0 0 0 0 0 0 0 0
02/07/2019 0 0 0 0 0 0 0 0
I need to sum every persons data by date obviously skipping the Date column. I would like the total to be the column next to the columns summed. "User_Add, User_Removed, User_Total" as shown below. My issue I face is that the prefix names won't always be the same, and the total amount of users changes.
My thought process would be count the total columns. Then loop through them doing the math, and dumping the results to a new column for every user. Then sort the columns alphabetically so they are grouped together.
something along the line of
loops = out.shape[1]
while loop < loops:
out['User_Total'] = out['User_Added']+out['User_Removed']
loop += 1
out.sort_index(axis=1, inplace=True)
However I'm not sure how to call an entire column by index, or if this is even a good way to handle it.
Here is what I'd like the output to look like.
Date David_Added David_Removed David_Total Malik_Added Malik_Removed Malik_Total Meghan_Added Meghan_Removed Meghan_Total Sucely_Added Sucely_Removed Sucely_Total
2/19/2019 3 1 4 39 41 80 1 6 7 14 24 38
2/18/2019 0 0 0 8 6 14 0 3 3 0 0 0
2/16/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/15/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/14/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/13/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/12/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/11/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/8/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/7/2019 0 0 0 0 0 0 0 0 0 0 0 0
Any help is much appreciated!
Using groupby with columns split
s=df.groupby(df.columns.str.split('_').str[0],axis=1).sum().drop('Date',1).add_suffix('_Total')
yourdf=pd.concat([df,s],1).sort_index(level=0,axis=1)
yourdf
Out[455]:
Date David_Added ... Sucely_Removed Sucely_Total
0 02/19/2019 3 ... 24 38
1 02/18/2019 0 ... 0 0
2 02/16/2019 0 ... 0 0
3 02/15/2019 0 ... 0 0
4 02/14/2019 0 ... 0 0
5 02/13/2019 0 ... 0 0
6 02/12/2019 0 ... 0 0
7 02/11/2019 0 ... 0 0
8 02/08/2019 0 ... 0 0
9 02/07/2019 0 ... 0 0
[10 rows x 13 columns]
Alternatively:
df.join(df.T.groupby(df.T.index.str.split("_").str[0]).sum().T.iloc[:,1:].add_suffix('_Total'))
Date David_Added David_Removed Malik_Added Malik_Removed \
0 02/19/2019 3 1 39 41
1 02/18/2019 0 0 8 6
2 02/16/2019 0 0 0 0
3 02/15/2019 0 0 0 0
4 02/14/2019 0 0 0 0
5 02/13/2019 0 0 0 0
6 02/12/2019 0 0 0 0
7 02/11/2019 0 0 0 0
8 02/08/2019 0 0 0 0
9 02/07/2019 0 0 0 0
Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed David_Total \
0 1 6 14 24 4
1 0 3 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
Malik_Total Meghan_Total Sucely_Total
0 80 7 38
1 14 3 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
I'm aware my this is not an answer for the question the OP posed, it is an advice on better practices that would solve the problem he is facing.
You have a structural problem. Having your dataframe modeled as such:
Date User_Name User_Added User_Removed User_Total
would make the code you've entered the solution to your problem, besides handling the variable number of users.

How can I optimize this script so it does not take a week to finish the task it is doing? (Used BASH PARALLEL too.)

I have a directory full of 60,000 files that are named by their molid. I have a second file in CSV format that has molids in column 1 with their respective CHEMBLID in column 2. I need to match the file name molid in the directory with a molid in the CSV file. If a match is found, the chemblid is added to the file (the file is rewritten to have the chemblid included). I am also using RDKit to calculate some properties I need written to the modified file too. I need to find a way to optimize this since I have to run it on 2 million files soon.
The way I am interacting with arg parse is by listing all of the molid.sdf files in my directory with the bash parallel command.
The csv file looks like:
molid,chembl
319855,CHEMBLtest
187481,CHEMBL1527718
https://www.dropbox.com/s/6ynd9vbwwf6lqka/output_2.csv?dl=0
The file that needs to be modified looks like:
298512 from gamess16 based ATb pipeline
OpenBabel06141815083D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0 0 0 0
1 17 1 0 0 0 0
2 20 1 1 0 0 0
2 1 1 0 0 0 0
3 21 1 0 0 0 0
3 2 1 0 0 0 0
4 2 1 0 0 0 0
4 22 1 0 0 0 0
5 6 1 0 0 0 0
5 4 1 0 0 0 0
6 7 1 0 0 0 0
7 26 1 0 0 0 0
7 8 2 0 0 0 0
8 27 1 0 0 0 0
9 8 1 0 0 0 0
9 10 2 0 0 0 0
10 28 1 0 0 0 0
11 10 1 0 0 0 0
11 12 1 0 0 0 0
12 31 1 0 0 0 0
12 30 1 0 0 0 0
13 11 2 0 0 0 0
14 15 2 0 0 0 0
14 13 1 0 0 0 0
15 9 1 0 0 0 0
16 6 2 0 0 0 0
16 15 1 0 0 0 0
18 1 1 0 0 0 0
23 4 1 0 0 0 0
24 5 1 0 0 0 0
25 5 1 0 0 0 0
29 12 1 0 0 0 0
32 13 1 0 0 0 0
33 14 1 0 0 0 0
34 16 1 0 0 0 0
M END
> <molid>
298512
$$$$
https://www.dropbox.com/s/9r9kandkbahgexj/298512.sdf?dl=0
And a modified file with how the current script works looks like:
298512 from gamess16 based ATb pipeline
RDKit 3D
34 35 0 0 1 0 0 0 0 0999 V2000
4.3885 -1.0129 1.6972 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3885 -0.7157 0.5784 C 0 0 2 0 0 0 0 0 0 0 0 0
3.6479 -1.5425 -0.5699 O 0 0 0 0 0 0 0 0 0 0 0 0
3.4599 0.7380 0.1087 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4770 1.0889 -1.0314 C 0 0 0 0 0 0 0 0 0 0 0 0
1.0165 0.9826 -0.6438 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3679 2.0729 0.0029 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9531 1.9980 0.3853 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7151 0.8214 0.1489 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.0800 0.7051 0.5321 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.8067 -0.4453 0.2969 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2581 -0.5636 0.6988 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.1581 -1.5376 -0.3496 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8397 -1.4605 -0.7357 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0762 -0.2830 -0.5007 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2871 -0.1675 -0.8844 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1834 -0.3978 2.5815 H 0 0 0 0 0 0 0 0 0 0 0 0
5.4123 -0.8100 1.3616 H 0 0 0 0 0 0 0 0 0 0 0 0
4.3301 -2.0654 2.0016 H 0 0 0 0 0 0 0 0 0 0 0 0
2.3709 -0.9175 0.9451 H 0 0 0 0 0 0 0 0 0 0 0 0
3.4809 -2.4622 -0.3076 H 0 0 0 0 0 0 0 0 0 0 0 0
3.2757 1.3897 0.9729 H 0 0 0 0 0 0 0 0 0 0 0 0
4.4830 0.9450 -0.2346 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6837 0.4273 -1.8785 H 0 0 0 0 0 0 0 0 0 0 0 0
2.6901 2.1132 -1.3637 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9314 2.9850 0.1903 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.4318 2.8450 0.8726 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.5539 1.5524 1.0253 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9075 -0.6633 -0.1810 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.4288 -1.4505 1.3221 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.5904 0.3146 1.2616 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.7228 -2.4486 -0.5381 H 0 0 0 0 0 0 0 0 0 0 0 0
-1.3620 -2.3059 -1.2268 H 0 0 0 0 0 0 0 0 0 0 0 0
0.7671 -1.0133 -1.3738 H 0 0 0 0 0 0 0 0 0 0 0 0
1 19 1 0
1 17 1 0
2 20 1 1
2 1 1 0
3 21 1 0
3 2 1 0
4 2 1 0
4 22 1 0
5 6 1 0
5 4 1 0
6 7 1 0
7 26 1 0
7 8 2 0
8 27 1 0
9 8 1 0
9 10 2 0
10 28 1 0
11 10 1 0
11 12 1 0
12 31 1 0
12 30 1 0
13 11 2 0
14 15 2 0
14 13 1 0
15 9 1 0
16 6 2 0
16 15 1 0
18 1 1 0
23 4 1 0
24 5 1 0
25 5 1 0
29 12 1 0
32 13 1 0
33 14 1 0
34 16 1 0
M END
> <molid> (1)
298512
> <CHEMBLID> (1)
CHEMBL3278713
> <i_user_TOTAL_CHARGE> (1)
0
> <SMILES> (1)
'[H]OC([H])(C([H])([H])[H])C([H])([H])C([H])([H])C1C([H])=C([H])C2=C([H])C(=C([H])C([H])=C2C=1[H])C([H])([H])[H]'
> <InChI> (1)
'InChI=1S/C15H18O/c1-11-3-7-15-10-13(5-4-12(2)16)6-8-14(15)9-11/h3,6-10,12,16H,4-5H2,1-2H3/t12-/m0/s1'
$$$$
https://www.dropbox.com/s/dfcmiv7d298s1fl/298512.chembl.sdf?dl=0
import os,shutil,csv
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-molid", help="molids from file names", type=str)
args = parser.parse_args()
print(args)
fn = args.molid
print(fn)
suppl = Chem.SDMolSupplier(fn, removeHs=False, sanitize=False)
ms = [x for x in suppl if x is not None] #sanity check loop to make sure the files were read
print("This is the number of entries read in")
print(len(ms))
print(len(suppl))
w=Chem.SDWriter('totaltest_with_chembl.sdf') #writes new file with all of the chemblid
new_files_with_chembl_id=os.path.splitext(fn)[0]
w=Chem.SDWriter(new_files_with_chembl_id+'.chembl.sdf')
molid_chemblid = open('output_2.csv','r')
csv_f = csv.reader(molid_chemblid)
header = next(csv_f)
molidIndex = header.index("molid")
chemblidIndex = header.index("chembl")
molid_chemblidList = []
for line in csv_f:
molid = line[molidIndex]
# print(molid)
chembl = line[chemblidIndex]
# print(chembl)
molid_chemblidList.append([molid,chembl])
for m in suppl: #molecule in MoleculeSet
print(m)
atbname=m.GetProp("_Name")
fillmein=atbname.split( )[0]
moleculeCharge=Chem.GetFormalCharge(m)
smiles_string=Chem.MolToSmiles(m)
inchi_string=Chem.MolToInchi(m)
print("molecularCharge")
print(moleculeCharge)
print("smile_string")
print(smiles_string)
print("inchi string")
print(inchi_string)
if fillmein == molid:
print(chembl)
print(chembl)
print(line)
print("this is line in molid_chemblid",line)
m.SetProp("CHEMBLID",chembl)
m.SetProp("i_user_TOTAL_CHARGE",repr(moleculeCharge))
m.SetProp("SMILES",repr(smiles_string))
m.SetProp("InChI",repr(inchi_string))
w.write(m)
The molid in the CSV file sounds like a unique key. Read the CSV file into a map/associative array where the molid is the key and the rest of the row is the value, parsed or not as you need it. Python has builtin CSV parsers with import csv.
Then loop just once over the files, find the chemblid by looking up the molid from the file name in the map.
This reduces the overall effort to roughly k*N where N is the number of files and molids and k is a small number.
Your algorithm has a loop within a loop which makes it N*N complexity. This indeed would take some time with N=2 million :-)
2 Million files is still a lot and may take between a few hours and a few days, depending how big the files are, how fast your disk access is and all that. Running a few threads in parallel will then help, until the I/O saturates. But test first, since implementing a parallel approach can get complicated. If you have to get through this only once, it might be ok to wait a bit longer.

Categories

Resources