Add comma with Variable name as suffix - python

I am working on a scenario to append all scores of a correlation matrix one below the other by combining two variables in one column and scores in another column and finally sort them in descending to find out the variables with maximum scores.
I am almost there but finding it hard to append comma (,) in a for loop along with variable name as suffix (i.e. line 6in the below code series.add_suffix(', Temp9am') where var is the variable name from for loop and I need , in front of it.
Please find the below code and I have attached screenshots of the dataframe I am working with.
df_sorted_
corre_score = pd.DataFrame()
for var in df_sorted_.columns:
series = df_sorted_[var]
series_ = series.add_suffix(', var')
series1 = pd.DataFrame(series_)
series1.columns = ['Score_']
series1
Dataframe Image is
Expected Output is as follows with all variables appended one below the other

using stack to stack values in one column.
then, rename the index.
dfn = df.stack()
dfn.index = pd.Series(list(dfn.index)).str.join(', ')

If I got your question right,you can do it like this.
df.index=df.index+","
pd.DataFrame(df.stack(),columns=["Score_"]).sort_values(by='Score_', ascending=False)

try using stack and <dataframe>.index.map(','.join)
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(5, 5), columns= [ chr(i+66) for i in range(5)])
corr = df.corr()
corr:
B C D E F
B 1.000000 0.861882 -0.687578 0.220408 -0.965567
C 0.861882 1.000000 -0.770134 -0.283359 -0.807373
D -0.687578 -0.770134 1.000000 0.341363 0.701713
E 0.220408 -0.283359 0.341363 1.000000 -0.262302
F -0.965567 -0.807373 0.701713 -0.262302 1.000000
corr[corr==1]=np.nan
s = corr.stack()
s.index = s.index.map(','.join)
pd.DataFrame(s, columns=['Score'])
Score
B,C 0.861882
B,D -0.687578
B,E 0.220408
B,F -0.965567
C,B 0.861882
C,D -0.770134
C,E -0.283359
C,F -0.807373
D,B -0.687578
D,C -0.770134
D,E 0.341363
D,F 0.701713
E,B 0.220408
E,C -0.283359
E,D 0.341363
E,F -0.262302
F,B -0.965567
F,C -0.807373
F,D 0.701713
F,E -0.262302

You can use pd.DataFrame.melt, this solves the issue of NaN omissions in df.stack:
>>> df = pd.get_dummies(list('abcd'))
>>> df
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
>>> df = df.melt(value_name='Score', ignore_index=False)
>>> df.index = df.index.astype(str) + ', ' + df.pop('variable')
Score
0, a 1
1, a 0
2, a 0
3, a 0
0, b 0
1, b 1
2, b 0
3, b 0
0, c 0
1, c 0
2, c 1
3, c 0
0, d 0
1, d 0
2, d 0
3, d 1
In your case you won't need to convert df.index to astype(str).

Related

How to conditionally add one hot vector to a Pandas DataFrame

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?
You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

How do I delete a column that contains only zeros from a given row in pandas

I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]

How to get maximum column condensed in one row with Pandas?

For the following dataframe
df = pd.DataFrame({"a": [1, 0, 11], "b": [7, 0, 0], "c": [0,10,0], "d": [1,0,0],
"e": [0,0,0], "name":["b","c","a"]})
print(df)
a b c d e name
0 1 7 0 1 0 b
1 0 0 10 0 0 c
2 11 0 0 0 0 a
I would like to get one row back that comprises the maximum values of each column plus the name of that column.
E.g. in this case:
a b c d e name
11 7 10 1 0 a
How can this be performed?
First get maximum values to one row DataFrame by max to_frame and transposse by T and then get name of maximum value per DataFrame with idxmax:
a = df.max().to_frame().T
a.loc[0, 'name'] = df.set_index('name').max(axis=1).idxmax()
print (a)
a b c d e name
0 11 7 10 1 0 a
Detail:
print (df.set_index('name').max(axis=1))
name
b 7
c 10
a 11
dtype: int64
print (df.set_index('name').max(axis=1).idxmax())
a
Use df.max() and create a Dataframeand Transpose as:
pd.DataFrame(df.max()).T
a b c d e name
0 11 7 10 1 0 c

Compute co-occurrence matrix by counting values in cells

I have a dataframe like this
df = pd.DataFrame({'a' : [1,1,0,0], 'b': [0,1,1,0], 'c': [0,0,1,1]})
I want to get
a b c
a 2 1 0
b 1 2 1
c 0 1 2
where a,b,c are column names, and I get the values counting '1' in all columns when the filter is '1' in another column.
For ample, when df.a == 1, we count a = 2, b =1, c = 0 etc
I made a loop to solve
matrix = []
for name, values in df.iteritems():
matrix.append(pd.DataFrame( df.groupby(name, as_index=False).apply(lambda x: x[x == 1].count())).values.tolist()[1])
pd.DataFrame(matrix)
But I think that there is a simpler solution, isn't it?
You appear to want the matrix product, so leverage DataFrame.dot:
df.T.dot(df)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
Alternatively, if you want the same level of performance without the overhead of pandas, you could compute the product with np.dot:
v = df.values
pd.DataFrame(v.T.dot(v), index=df.columns, columns=df.columns)
Or, if you want to get cute,
(lambda a, c: pd.DataFrame(a.T.dot(a), c, c))(df.values, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
—piRSquared
np.einsum
Not as pretty as df.T.dot(df) but how often do you see np.einsum amirite?
pd.DataFrame(np.einsum('ij,ik->jk', df, df), df.columns, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
You can do a multiplication using # operator for numpy arrays.
df = pd.DataFrame(df.values.T # df.values, df.columns, df.columns)
Numpy matmul
np.matmul(df.values.T,df.values)
Out[87]:
array([[2, 1, 0],
[1, 2, 1],
[0, 1, 2]], dtype=int64)
#pd.DataFrame(np.matmul(df.values.T,df.values), df.columns, df.columns)

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Categories

Resources