I have the following data frame:
Company Firm
125911 1
125911 2
32679 3
32679 5
32679 5
32679 8
32679 10
32679 12
43805 14
67734 8
67734 9
67734 10
67734 10
67734 11
67734 12
67734 13
74240 4
74240 6
74240 7
Where basically the firm makes an investment into the company at a specific year which in this case is the same year for all companies. What I want to do in python is to create a simple adjacency matrix with only 0's and 1's. 1 if two firms has made an investment into the same company. So even if firm 10 and 8 for example have invested in two different firms at the same it will still be a 1.
The resulting matrix I am looking for looks like:
Firm 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have seen similar questions where you can use crosstab, however in that case each company will only have one row with all the firms in different columns instead. So I am wondering what the best and most efficient way to tackle this specific problem is? Any help is greatly appreciated.
dfs = []
for s in df.groupby("Company").agg(list).values:
dfs.append(pd.DataFrame(index=set(s[0]), columns=set(s[0])).fillna(1))
out = pd.concat(dfs).groupby(level=0).sum().gt(0).astype(int)
np.fill_diagonal(out.values, 0)
print(out)
Prints:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 1 0 1 0 0
4 0 0 0 0 0 1 1 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 1 0 1 0 0
6 0 0 0 1 0 0 1 0 0 0 0 0 0 0
7 0 0 0 1 0 1 0 0 0 0 0 0 0 0
8 0 0 1 0 1 0 0 0 1 1 1 1 1 0
9 0 0 0 0 0 0 0 1 0 1 1 1 1 0
10 0 0 1 0 1 0 0 1 1 0 1 1 1 0
11 0 0 0 0 0 0 0 1 1 1 0 1 1 0
12 0 0 1 0 1 0 0 1 1 1 1 0 1 0
13 0 0 0 0 0 0 0 1 1 1 1 1 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dfm = df.merge(df, on="Company").query("Firm_x != Firm_y")
out = pd.crosstab(dfm['Firm_x'], dfm['Firm_y'])
>>> out
Firm_y 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Firm_x
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 4 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 2 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 5 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 2 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 1 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 1
I have a DataFrame, lets say 3000x3000 with int values from 0 to 10 and I want to break it down into categories and save into separate files.
Categories should be something like 0-3, 4-5, 5-10 for example.
As a result I want to get 3 files of the same shape but only with relevant values per category and these values should stay at the original positions.
At first I thought to copy df for each category and use replace to remove all irrelevant values, but it doesn't sound right.
Hope this is not very confusing.
df example:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 7 0
2 0 0 2 3 0 0 0 0 6 7
3 0 0 2 3 0 0 0 0 9 6
4 0 0 0 1 0 0 5 4 8 7
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
as the result I want 3 dataframes:
cat1:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
cat2:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 5 4 0 0
5 0 0 0 0 0 0 5 4 0 0
6 0 0 0 0 0 0 4 5 0 0
7 0 0 0 0 0 0 4 4 0 0
8 0 0 0 0 0 0 0 4 0 0
9 0 0 0 0 0 0 0 0 0 0
cat3:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 7 0
2 0 0 0 0 0 0 0 0 6 7
3 0 0 0 0 0 0 0 0 9 6
4 0 0 0 0 0 0 0 0 8 7
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
You want where
df1 = df.where((df > 0) & (df <=3), 0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
2 0 0 2 3 0 0 0 0 0 0
3 0 0 2 3 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
You can write similar logic for df2 and df3
I am new to the python and pandas. Here, I have a following dataframe .
did features offset word JAPE_feature manual_feature
0 200 0 aa 200 200
0 200 11 bf 200 200
0 200 12 vf 100 100
0 100 13 rw 2200 2200
0 100 14 asd 2600 100
0 2200 16 dsdd 2200 2200
0 2600 18 wd 2200 2600
0 2600 20 wsw 2600 2600
0 4600 21 sd 4600 4600
Now , I have an array which has all the feature values which can appear for that id.
feat = [100,200,2200,2600,156,162,4600,100]
Now, I am trying to create a dataframe whic will look like,
id Features
100 200 2200 2600 156 162 4600 100
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 1 0
so, while doing comparision ,
feature_manual
1
1
0
0
1
1
1
1
1
Here compairing the features and the manual_feature columns. if values are same then 1 or else 0. so 200 and 200 for 0 is same in both so 1
So, this is the expected output. Here I am trying to add the value 1 for that feature in the new csv and for other 0.
So, it is by row by row.
So, If we check in the first row the feature is 200 so there is 1 at 200 and others are 0.
can any one help me with this ?
what I tried is
mux = pd.MultiIndex.from_product([['features'],feat)
df = pd.DataFrame(data, columns=mux)
SO, Here creatig subcolumns but removing all other values . can any one help me ?
Use get_dummies with DataFrame.reindex:
feat = [100,200,2200,2600,156,162,4600,100]
df = df.join(pd.get_dummies(df.pop('features')).reindex(feat, axis=1, fill_value=0))
print (df)
id 100 200 2200 2600 156 162 4600 100
0 0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0
3 4 1 0 0 0 0 0 0 1
4 5 1 0 0 0 0 0 0 1
5 7 0 0 1 0 0 0 0 0
6 8 0 0 0 1 0 0 0 0
7 9 0 0 0 1 0 0 0 0
8 10 0 0 0 0 0 0 1 0
If need MultiIndex only pass mux to reindex, but also convert id column to index:
feat = [100,200,2200,2600,156,162,4600,100]
mux = pd.MultiIndex.from_product([['features'],feat])
df = pd.get_dummies(df.set_index('id')['features']).reindex(mux, axis=1, fill_value=0)
print (df)
features
100 200 2200 2600 156 162 4600 100
id
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
EDIT:
cols = ['features', 'JAPE_feature', 'manual_feature']
df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
did offset word features JAPE_feature \
NaN NaN NaN 100 200 2200 2600 4600 100 200 2200 2600
0 0 0 aa 0 1 0 0 0 0 1 0 0
1 0 11 bf 0 1 0 0 0 0 1 0 0
2 0 12 vf 0 1 0 0 0 1 0 0 0
3 0 13 rw 1 0 0 0 0 0 0 1 0
4 0 14 asd 1 0 0 0 0 0 0 0 1
5 0 16 dsdd 0 0 1 0 0 0 0 1 0
6 0 18 wd 0 0 0 1 0 0 0 1 0
7 0 20 wsw 0 0 0 1 0 0 0 0 1
8 0 21 sd 0 0 0 0 1 0 0 0 0
manual_feature
4600 100 200 2200 2600 4600
0 0 0 1 0 0 0
1 0 0 1 0 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 1 0 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 1 0
8 1 0 0 0 0 1
If want avoid missing values in MultIndex in columns for columns with no MultiIndex:
cols = ['features', 'JAPE_feature', 'manual_feature']
df = df.set_index(df.columns.difference(cols).tolist())
df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
features JAPE_feature \
100 200 2200 2600 4600 100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0 0 1 0 0 0
11 bf 0 1 0 0 0 0 1 0 0 0
12 vf 0 1 0 0 0 1 0 0 0 0
13 rw 1 0 0 0 0 0 0 1 0 0
14 asd 1 0 0 0 0 0 0 0 1 0
16 dsdd 0 0 1 0 0 0 0 1 0 0
18 wd 0 0 0 1 0 0 0 1 0 0
20 wsw 0 0 0 1 0 0 0 0 1 0
21 sd 0 0 0 0 1 0 0 0 0 1
manual_feature
100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0
11 bf 0 1 0 0 0
12 vf 1 0 0 0 0
13 rw 0 0 1 0 0
14 asd 1 0 0 0 0
16 dsdd 0 0 1 0 0
18 wd 0 0 0 1 0
20 wsw 0 0 0 1 0
21 sd 0 0 0 0 1
EDIT:
If want compare some column from list by manual_feature column use DataFrame.eq with converting to integers:
cols = ['JAPE_feature', 'features']
df1 = df[cols].eq(df['manual_feature'], axis=0).astype(int)
print (df1)
JAPE_feature features
0 1 1
1 1 1
2 1 0
3 1 0
4 0 1
5 1 1
6 0 1
7 1 1
8 1 1
Less fancy solution, but maybe easier to understand:
First of all put the features that will decide which feature you choose on each row in a list called for example list_features.
Then:
# List all the features possible and create an empty df
feat = [100,200,2200,2600,156,162,4600,100]
df_final= pd.DataFrame({x:[] for x in feat})
# Fill the df little by little
for x in list_features:
df_final = df_final.append({y:1 if x==y else 0 for y in feat }, ignore_index=True)
These types of problems can be solved in many ways. But here I am using simple way to solve it. Creating df with those features list as a column names and the using some comparison logic to update df with 0 and 1. You can use some other logic to avoid use of for loops.
import pandas as pd
data = {'id':[0,1,2,3,4,5,7,8,9,10],
'features':[200, 200, 200, 200, 100, 100, 2200, 2600, 2600, 4600]}
df1 = pd.DataFrame(data)
features_list = [100,200,2200,2600,156,162,4600]
id_list = df1.id.to_list()
df2 = pd.DataFrame(columns=features_list)
list2 = list()
for i in id_list:
list1 = list()
for k in df2.columns:
if df1[df1.id == i].features.iloc[0] == k:
list1.append(1)
else:
list1.append(0)
list2.append(list1)
for i in range (0,len(list2)):
df2.loc[i] = list2[i]
df2.insert(0, "id", id_list)
>>>(df2)
id 100 200 2200 2600 156 162 4600
0 0 0 1 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 0 0 0 0
4 4 1 0 0 0 0 0 0
5 5 1 0 0 0 0 0 0
6 7 0 0 1 0 0 0 0
7 8 0 0 0 1 0 0 0
8 9 0 0 0 1 0 0 0
9 10 0 0 0 0 0 0 1
Here is my current dataframe named out
Date David_Added David_Removed Malik_Added Malik_Removed Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed
02/19/2019 3 1 39 41 1 6 14 24
02/18/2019 0 0 8 6 0 3 0 0
02/16/2019 0 0 0 0 0 0 0 0
02/15/2019 0 0 0 0 0 0 0 0
02/14/2019 0 0 0 0 0 0 0 0
02/13/2019 0 0 0 0 0 0 0 0
02/12/2019 0 0 0 0 0 0 0 0
02/11/2019 0 0 0 0 0 0 0 0
02/08/2019 0 0 0 0 0 0 0 0
02/07/2019 0 0 0 0 0 0 0 0
I need to sum every persons data by date obviously skipping the Date column. I would like the total to be the column next to the columns summed. "User_Add, User_Removed, User_Total" as shown below. My issue I face is that the prefix names won't always be the same, and the total amount of users changes.
My thought process would be count the total columns. Then loop through them doing the math, and dumping the results to a new column for every user. Then sort the columns alphabetically so they are grouped together.
something along the line of
loops = out.shape[1]
while loop < loops:
out['User_Total'] = out['User_Added']+out['User_Removed']
loop += 1
out.sort_index(axis=1, inplace=True)
However I'm not sure how to call an entire column by index, or if this is even a good way to handle it.
Here is what I'd like the output to look like.
Date David_Added David_Removed David_Total Malik_Added Malik_Removed Malik_Total Meghan_Added Meghan_Removed Meghan_Total Sucely_Added Sucely_Removed Sucely_Total
2/19/2019 3 1 4 39 41 80 1 6 7 14 24 38
2/18/2019 0 0 0 8 6 14 0 3 3 0 0 0
2/16/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/15/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/14/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/13/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/12/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/11/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/8/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/7/2019 0 0 0 0 0 0 0 0 0 0 0 0
Any help is much appreciated!
Using groupby with columns split
s=df.groupby(df.columns.str.split('_').str[0],axis=1).sum().drop('Date',1).add_suffix('_Total')
yourdf=pd.concat([df,s],1).sort_index(level=0,axis=1)
yourdf
Out[455]:
Date David_Added ... Sucely_Removed Sucely_Total
0 02/19/2019 3 ... 24 38
1 02/18/2019 0 ... 0 0
2 02/16/2019 0 ... 0 0
3 02/15/2019 0 ... 0 0
4 02/14/2019 0 ... 0 0
5 02/13/2019 0 ... 0 0
6 02/12/2019 0 ... 0 0
7 02/11/2019 0 ... 0 0
8 02/08/2019 0 ... 0 0
9 02/07/2019 0 ... 0 0
[10 rows x 13 columns]
Alternatively:
df.join(df.T.groupby(df.T.index.str.split("_").str[0]).sum().T.iloc[:,1:].add_suffix('_Total'))
Date David_Added David_Removed Malik_Added Malik_Removed \
0 02/19/2019 3 1 39 41
1 02/18/2019 0 0 8 6
2 02/16/2019 0 0 0 0
3 02/15/2019 0 0 0 0
4 02/14/2019 0 0 0 0
5 02/13/2019 0 0 0 0
6 02/12/2019 0 0 0 0
7 02/11/2019 0 0 0 0
8 02/08/2019 0 0 0 0
9 02/07/2019 0 0 0 0
Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed David_Total \
0 1 6 14 24 4
1 0 3 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
Malik_Total Meghan_Total Sucely_Total
0 80 7 38
1 14 3 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
I'm aware my this is not an answer for the question the OP posed, it is an advice on better practices that would solve the problem he is facing.
You have a structural problem. Having your dataframe modeled as such:
Date User_Name User_Added User_Removed User_Total
would make the code you've entered the solution to your problem, besides handling the variable number of users.
I want to convert a dataframe to a 3D np.array
I have tried df = df.as_matrix(). But it is a 2D matrix.
The Dataframe is df:
days 0 1 2 3 4 5 6 7 8 9 ... 20 21 \
enrollment_id event ...
1 access 0 0 3 0 0 0 0 8 0 4 ... 20 0
discussion 0 0 0 0 0 0 0 0 0 0 ... 0 0
navigate 0 0 1 0 0 0 0 4 0 1 ... 0 0
page_close 0 0 1 0 0 0 0 6 0 2 ... 17 0
problem 0 0 8 0 0 0 0 6 0 0 ... 0 0
video 0 0 0 0 0 0 0 0 0 0 ... 14 0
wiki 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 access 7 0 0 0 2 0 0 11 0 0 ... 0 0
discussion 0 0 0 0 0 0 0 0 0 0 ... 0 0
navigate 4 0 0 0 1 0 0 1 0 0 ... 0 0
page_close 2 0 0 0 0 0 0 5 0 0 ... 0 0
problem 14 0 0 0 0 0 0 13 0 0 ... 0 0
video 1 0 0 0 0 0 0 0 0 0 ... 0 0
wiki 0 0 0 0 0 0 0 0 0 0 ... 0 0
As an array is just the bare values of the dataframe, simply do
arr = df.values
If the shape is not what you want, you can play around with the NumPy reshape method/function.