Pandas extract columns and rows by ID from a distance matrix

Pandas extract columns and rows by ID from a distance matrix - python

I have a distance matrix with IDs as column and row names:
A B C D
A 0 1 2 3
B 1 0 4 5
C 2 4 0 6
D 3 5 6 0
How to efficiently extract values from a large matrix, e.g. for IDs A and C to get this matrix:
A C
A 0 2
C 2 0
Edit, missing IDs in the matrix should be ignored.

Use DataFrame.loc for get values by labels:
vals = ['A','C']
df = df.loc[vals, vals]
print (df)
A C
A 0 2
C 2 0
EDIT: If some values not match and need omit them add Index.intersection:
vals = ['J','A','C']
new = df.columns.intersection(vals, sort=False)
df = df.loc[new, new]
print (df)
A C
A 0 2
C 2 0

Related

Create adjacency matrix from adjacency list

I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks

Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0

Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.

You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c

Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c

You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

merging rows with repeating column values

I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')

Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b

Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b

You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b

Pandas: Inserting an empty row after every 2nd row in a data frame

So far, I have this code that adds a row of zeros every other row (from this question):
import pandas as pd
import numpy as np
def Add_Zeros(df):
zeros = np.where(np.empty_like(df.values), 0, 0)
data = np.hstack([df.values, zeros]).reshape(-1, df.shape[1])
df_ordered = pd.DataFrame(data, columns=df.columns)
return df_ordered
Which results in the following data frame:
A B
0 a a
1 0 0
2 b b
3 0 0
4 c c
5 0 0
6 d d
But I need it to add the row of zeros every 2nd row instead, like this:
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I've tried altering the code, but each time, I get an error that says that zeros and df don't match in size.
I should also point out that I have a lot more rows and columns than I wrote here.
How can I do this?

Option 1
Using groupby
s = pd.Series(0, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 2
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Option 2
from itertools import repeat, chain
v = df.values
pd.DataFrame(
np.row_stack(list(chain(*zip(v[0::2], v[1::2], repeat(z))))),
columns=df.columns
)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0

Python method to compare 1 value_id against another columns' value_ids in separate dataframes?

I have 2 csv files. Each contains a data set with multiple columns and an ASSET_ID column. I used pandas to read each csv file in as a df1 and df2. My problem has been trying to define a function to iterate over the ASSET_ID value in df1 and compare each value against all the ASSET_ID values in df2. From there I want to return all the corresponding rows from df1's ASSET_ID's that matched df2. Any help would be appreciated I've been working on this for hours with little to show for it. dtypes are float or int.
My configuration = windows xp, python 2.7 anaconda distribution

Create a boolean mask of the values will index the rows where the 2 df's match, no need to iterate and much faster.
Example:
# define a list of values
a = list('abcdef')
b = range(6)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
# c has x values for 'a' and 'd' so these should not match
c = list('xbcxef')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(b)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
[6 rows x 2 columns]
In [4]:
# now index your df using boolean condition on the values
df[df.X == df1.X]
Out[4]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]
EDIT:
So if you have different length series then that won't work, in which case you can use isin:
So create 2 dataframes of different lengths:
a = list('abcdef')
b = range(6)
d = range(10)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
c = list('xbcxefxghi')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(d)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
6 x 6
7 g 7
8 h 8
9 i 9
[10 rows x 2 columns]
Now use isin to select rows from df1 where the id's exist in df:
In [7]:
df1[df1.X.isin(df.X)]
Out[7]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas extract columns and rows by ID from a distance matrix - python

I have a distance matrix with IDs as column and row names: A B C D A 0 1 2 3 B 1 0 4 5 C 2 4 0 6 D 3 5 6 0 How to efficiently extract values from a large matrix, e.g. for IDs A and C to get this matrix: A C A 0 2 C 2 0 Edit, missing IDs in the matrix should be ignored.

Related

Create adjacency matrix from adjacency list

Grouping the columns and identifying values which are not part of this group

merging rows with repeating column values

Pandas: Inserting an empty row after every 2nd row in a data frame

Python method to compare 1 value_id against another columns' value_ids in separate dataframes?

Categories

Resources