Convert Two column data frame to occurrence matrix in pandas

Convert Two column data frame to occurrence matrix in pandas - python

Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?

Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0

Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0

This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.

You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C

Join an array to every row in the pandas dataframe

I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?

Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)

Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

Create adjacency matrix from adjacency list

I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks

Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0

Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0

Pairwise matrix counts of two columns using pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:
col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h
What I would like to get as output would be the following matrix of counts, for e.g.:
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.

Pandas often has simple functions built in - in this case, you want crosstab:
pd.crosstab(dat['col1'], dat['col2'])
full code:
import pandas as pd
from io import StringIO
x = '''col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h'''
dat = pd.read_csv(StringIO(x), sep = '\s+')
pd.crosstab(dat['col1'], dat['col2'])

You're looking for a crosstab:
count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])
print(count_matrix)
col2 e f g h
col1
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:
count_matrix = count_matrix.rename_axis(index=None, columns=None)
print(count_matrix)
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you want that all together in one snippet:
count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
.rename_axis(index=None, columns=None))

Pandas: Inserting an empty row after every 2nd row in a data frame

So far, I have this code that adds a row of zeros every other row (from this question):
import pandas as pd
import numpy as np
def Add_Zeros(df):
zeros = np.where(np.empty_like(df.values), 0, 0)
data = np.hstack([df.values, zeros]).reshape(-1, df.shape[1])
df_ordered = pd.DataFrame(data, columns=df.columns)
return df_ordered
Which results in the following data frame:
A B
0 a a
1 0 0
2 b b
3 0 0
4 c c
5 0 0
6 d d
But I need it to add the row of zeros every 2nd row instead, like this:
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I've tried altering the code, but each time, I get an error that says that zeros and df don't match in size.
I should also point out that I have a lot more rows and columns than I wrote here.
How can I do this?

Option 1
Using groupby
s = pd.Series(0, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 2
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Option 2
from itertools import repeat, chain
v = df.values
pd.DataFrame(
np.row_stack(list(chain(*zip(v[0::2], v[1::2], repeat(z))))),
columns=df.columns
)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert Two column data frame to occurrence matrix in pandas - python

This approach yields the same result in a scipy sparse coo matrix much faster from scipy import sparse df['col1'] = df['col1'].astype("category") df['col2'] = df['col2'].astype("category") df['ones'] = 1 user_items = sparse.coo_matrix((df.ones.astype(float), (df.col1.cat.codes, df.col2.cat.codes)))

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

Join an array to every row in the pandas dataframe

Create adjacency matrix from adjacency list

Pairwise matrix counts of two columns using pandas [duplicate]

Pandas: Inserting an empty row after every 2nd row in a data frame

Categories

Resources