Compute co-occurrence matrix by counting values in cells - python

I have a dataframe like this
df = pd.DataFrame({'a' : [1,1,0,0], 'b': [0,1,1,0], 'c': [0,0,1,1]})
I want to get
a b c
a 2 1 0
b 1 2 1
c 0 1 2
where a,b,c are column names, and I get the values counting '1' in all columns when the filter is '1' in another column.
For ample, when df.a == 1, we count a = 2, b =1, c = 0 etc
I made a loop to solve
matrix = []
for name, values in df.iteritems():
matrix.append(pd.DataFrame( df.groupby(name, as_index=False).apply(lambda x: x[x == 1].count())).values.tolist()[1])
pd.DataFrame(matrix)
But I think that there is a simpler solution, isn't it?

You appear to want the matrix product, so leverage DataFrame.dot:
df.T.dot(df)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
Alternatively, if you want the same level of performance without the overhead of pandas, you could compute the product with np.dot:
v = df.values
pd.DataFrame(v.T.dot(v), index=df.columns, columns=df.columns)
Or, if you want to get cute,
(lambda a, c: pd.DataFrame(a.T.dot(a), c, c))(df.values, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
—piRSquared

np.einsum
Not as pretty as df.T.dot(df) but how often do you see np.einsum amirite?
pd.DataFrame(np.einsum('ij,ik->jk', df, df), df.columns, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2

You can do a multiplication using # operator for numpy arrays.
df = pd.DataFrame(df.values.T # df.values, df.columns, df.columns)

Numpy matmul
np.matmul(df.values.T,df.values)
Out[87]:
array([[2, 1, 0],
[1, 2, 1],
[0, 1, 2]], dtype=int64)
#pd.DataFrame(np.matmul(df.values.T,df.values), df.columns, df.columns)

Related

Add comma with Variable name as suffix

I am working on a scenario to append all scores of a correlation matrix one below the other by combining two variables in one column and scores in another column and finally sort them in descending to find out the variables with maximum scores.
I am almost there but finding it hard to append comma (,) in a for loop along with variable name as suffix (i.e. line 6in the below code series.add_suffix(', Temp9am') where var is the variable name from for loop and I need , in front of it.
Please find the below code and I have attached screenshots of the dataframe I am working with.
df_sorted_
corre_score = pd.DataFrame()
for var in df_sorted_.columns:
series = df_sorted_[var]
series_ = series.add_suffix(', var')
series1 = pd.DataFrame(series_)
series1.columns = ['Score_']
series1
Dataframe Image is
Expected Output is as follows with all variables appended one below the other
using stack to stack values in one column.
then, rename the index.
dfn = df.stack()
dfn.index = pd.Series(list(dfn.index)).str.join(', ')
If I got your question right,you can do it like this.
df.index=df.index+","
pd.DataFrame(df.stack(),columns=["Score_"]).sort_values(by='Score_', ascending=False)
try using stack and <dataframe>.index.map(','.join)
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(5, 5), columns= [ chr(i+66) for i in range(5)])
corr = df.corr()
corr:
B C D E F
B 1.000000 0.861882 -0.687578 0.220408 -0.965567
C 0.861882 1.000000 -0.770134 -0.283359 -0.807373
D -0.687578 -0.770134 1.000000 0.341363 0.701713
E 0.220408 -0.283359 0.341363 1.000000 -0.262302
F -0.965567 -0.807373 0.701713 -0.262302 1.000000
corr[corr==1]=np.nan
s = corr.stack()
s.index = s.index.map(','.join)
pd.DataFrame(s, columns=['Score'])
Score
B,C 0.861882
B,D -0.687578
B,E 0.220408
B,F -0.965567
C,B 0.861882
C,D -0.770134
C,E -0.283359
C,F -0.807373
D,B -0.687578
D,C -0.770134
D,E 0.341363
D,F 0.701713
E,B 0.220408
E,C -0.283359
E,D 0.341363
E,F -0.262302
F,B -0.965567
F,C -0.807373
F,D 0.701713
F,E -0.262302
You can use pd.DataFrame.melt, this solves the issue of NaN omissions in df.stack:
>>> df = pd.get_dummies(list('abcd'))
>>> df
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
>>> df = df.melt(value_name='Score', ignore_index=False)
>>> df.index = df.index.astype(str) + ', ' + df.pop('variable')
Score
0, a 1
1, a 0
2, a 0
3, a 0
0, b 0
1, b 1
2, b 0
3, b 0
0, c 0
1, c 0
2, c 1
3, c 0
0, d 0
1, d 0
2, d 0
3, d 1
In your case you won't need to convert df.index to astype(str).

How to conditionally add one hot vector to a Pandas DataFrame

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?
You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

How do I delete a column that contains only zeros from a given row in pandas

I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]

How to efficiently rearrange pandas data as follows?

I need some help with a concise and first of all efficient formulation in pandas of the following operation:
Given a data frame of the format
id a b c d
1 0 -1 1 1
42 0 1 0 0
128 1 -1 0 1
Construct a data frame of the format:
id one_entries
1 "c d"
42 "b"
128 "a d"
That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1.
Here's one way using boolean rule and applying lambda func.
In [58]: df
Out[58]:
id a b c d
0 1 0 -1 1 1
1 42 0 1 0 0
2 128 1 -1 0 1
In [59]: cols = list('abcd')
In [60]: (df[cols] > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
Out[60]:
0 c d
1 b
2 a d
dtype: object
You can assign the result to df['one_entries'] =
Details of apply func.
Take first row.
In [83]: x = df[cols].ix[0] > 0
In [84]: x
Out[84]:
a False
b False
c True
d True
Name: 0, dtype: bool
x gives you Boolean values for the row, values greater than zero. x[x] will return only True. Essentially a series with column names as index.
In [85]: x[x]
Out[85]:
c True
d True
Name: 0, dtype: bool
x[x].index gives you the column names.
In [86]: x[x].index
Out[86]: Index([u'c', u'd'], dtype='object')
Same reasoning as John Galt's, but a bit shorter, constructing a new DataFrame from a dict.
pd.DataFrame({
'one_entries': (test_df > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
})
# one_entries
# 1 c d
# 42 b
# 128 a d

Find the max of two or more columns with pandas

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

Categories

Resources