Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?
Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))
Related
I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.
You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?
Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)
Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c
I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks
Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0
Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:
col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h
What I would like to get as output would be the following matrix of counts, for e.g.:
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.
Pandas often has simple functions built in - in this case, you want crosstab:
pd.crosstab(dat['col1'], dat['col2'])
full code:
import pandas as pd
from io import StringIO
x = '''col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h'''
dat = pd.read_csv(StringIO(x), sep = '\s+')
pd.crosstab(dat['col1'], dat['col2'])
You're looking for a crosstab:
count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])
print(count_matrix)
col2 e f g h
col1
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:
count_matrix = count_matrix.rename_axis(index=None, columns=None)
print(count_matrix)
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you want that all together in one snippet:
count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
.rename_axis(index=None, columns=None))
So far, I have this code that adds a row of zeros every other row (from this question):
import pandas as pd
import numpy as np
def Add_Zeros(df):
zeros = np.where(np.empty_like(df.values), 0, 0)
data = np.hstack([df.values, zeros]).reshape(-1, df.shape[1])
df_ordered = pd.DataFrame(data, columns=df.columns)
return df_ordered
Which results in the following data frame:
A B
0 a a
1 0 0
2 b b
3 0 0
4 c c
5 0 0
6 d d
But I need it to add the row of zeros every 2nd row instead, like this:
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I've tried altering the code, but each time, I get an error that says that zeros and df don't match in size.
I should also point out that I have a lot more rows and columns than I wrote here.
How can I do this?
Option 1
Using groupby
s = pd.Series(0, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 2
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Option 2
from itertools import repeat, chain
v = df.values
pd.DataFrame(
np.row_stack(list(chain(*zip(v[0::2], v[1::2], repeat(z))))),
columns=df.columns
)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0