I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks
Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0
Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0
Related
I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?
Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)
Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
I have a distance matrix with IDs as column and row names:
A B C D
A 0 1 2 3
B 1 0 4 5
C 2 4 0 6
D 3 5 6 0
How to efficiently extract values from a large matrix, e.g. for IDs A and C to get this matrix:
A C
A 0 2
C 2 0
Edit, missing IDs in the matrix should be ignored.
Use DataFrame.loc for get values by labels:
vals = ['A','C']
df = df.loc[vals, vals]
print (df)
A C
A 0 2
C 2 0
EDIT: If some values not match and need omit them add Index.intersection:
vals = ['J','A','C']
new = df.columns.intersection(vals, sort=False)
df = df.loc[new, new]
print (df)
A C
A 0 2
C 2 0
I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks
Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list
maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G
The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
So far, I have this code that adds a row of zeros every other row (from this question):
import pandas as pd
import numpy as np
def Add_Zeros(df):
zeros = np.where(np.empty_like(df.values), 0, 0)
data = np.hstack([df.values, zeros]).reshape(-1, df.shape[1])
df_ordered = pd.DataFrame(data, columns=df.columns)
return df_ordered
Which results in the following data frame:
A B
0 a a
1 0 0
2 b b
3 0 0
4 c c
5 0 0
6 d d
But I need it to add the row of zeros every 2nd row instead, like this:
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I've tried altering the code, but each time, I get an error that says that zeros and df don't match in size.
I should also point out that I have a lot more rows and columns than I wrote here.
How can I do this?
Option 1
Using groupby
s = pd.Series(0, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 2
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Option 2
from itertools import repeat, chain
v = df.values
pd.DataFrame(
np.row_stack(list(chain(*zip(v[0::2], v[1::2], repeat(z))))),
columns=df.columns
)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?
Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))