How to conditionally add one hot vector to a Pandas DataFrame

How to conditionally add one hot vector to a Pandas DataFrame - python

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?

You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0

Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0

You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

Related

Split a column into multiple columns that has value as list

I have a problem about splitting column into multiple columns
I have a data like table on the top.
column B contains the values of list .
I want to split the values of column B into columns like the right table. The values in the top table will be the number of occurrences of the values in column B (bottom table).
input:
A B
a [1, 2]
b [3, 4, 5]
c [1, 5]
expected output:
A 1 2 3 4 5
a 1 1 0 0 0
b 0 0 1 1 1
c 1 0 0 0 1

You can explode the column of lists and use crosstab:
df2 = df.explode('B')
out = pd.crosstab(df2['A'], df2['B']).reset_index().rename_axis(columns=None)
output:
A 1 2 3 4 5
0 a 1 1 0 0 0
1 b 0 0 1 1 1
2 c 1 0 0 0 1
used input:
df = pd.DataFrame({'A': list('abc'), 'B': [[1,2], [3,4,5], [1,5]]})

How to convert an M x M pandas DataFrame into an N X 2 Dataframe?

I have a pandas DataFrame like below:
df = pd.DataFrame({"type": ["A", "B", "C"],
"A": [0, 0, 12],
"B": [1, 3, 0],
"C": [0, 1, 1]}
)
I want to transform this to a DataFrame that is N X 2, where I concatenate the column and type values with " - " as delimiter. The output should look like this:
pair value
A - A 0
A - B 0
A - C 12
B - A 1
B - B 3
B - C 0
C - A 0
C - B 1
C - C 1
I don't know if there is a name for what I want to accomplish (I thought about pivoting but I believe that is something else), so that didn't help me in googling the solution for this. How to solve this problem efficiently?

1st set index as type and then unstack and convert the result to dataframe.
try:
x = df.set_index('type').unstack().to_frame('value')
x.index = x.index.map(' - '.join)
res = x.rename_axis('pair').reset_index()
res:
pair value
0 A - A 0
1 A - B 0
2 A - C 12
3 B - A 1
4 B - B 3
5 B - C 0
6 C - A 0
7 C - B 1
8 C - C 1

First melt the column type, then join variable, and type column with a hyphen -, and take the required columns only:
>>> out = df.melt(id_vars='type')
>>> out.assign(pair=out['variable']+'-'+out['type'])[['pair', 'value']]
pair value
0 A-A 0
1 A-B 0
2 A-C 12
3 B-A 1
4 B-B 3
5 B-C 0
6 C-A 0
7 C-B 1
8 C-C 1

How to drop columns from a dataframe that contain specific values in any row

In a pandas dataframe, I need to find columns that contain a zero in any row, and drop that whole column.
For example, if my dataframe looks like this:
A B C D E F G H
0 1 0 1 0 1 1 1 1
1 0 1 1 1 1 0 1 1
I need to drop columns A, B, D, and F. I know how to drop the columns, but identifying the ones with zeros programmatically is eluding me.

You can use .loc to slice the dataframe and perform boolean indexation on the columns, checking which have any 0 in them:
df.loc[:,~(df==0).any()]
C E G H
0 1 1 1 1
1 1 1 1 1
Or equivalently you can do:
df.loc[:,(df!=0).all()]

Try this:
Code:
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1], 'B': [1, 0, 1]})
for col in df.columns:
if 0 in df[col].tolist():
df = df.drop(columns=col)
df

Creating a string from pandas column and row data

I am interested in generating a string that is composed of pandas row and column data. Given the following pandas data frame I am interested only in generating a string from columns with positive values
index A B C
1 0 1 2
2 0 0 3
3 0 0 0
4 1 0 0
I would like to create a new column that appends a string that lists which columns in a row were positive. Then I would drop all of the rows that the data came from:
index Positives
1 B-1, C-2
2 C-3
4 A-1

Here is one way using pd.DataFrame.apply + pd.Series.apply:
df = pd.DataFrame([[1, 0, 1, 2], [2, 0, 0, 3], [3, 0, 0, 0], [4, 1, 0, 0]],
columns=['index', 'A', 'B', 'C'])
def formatter(x):
x = x[x > 0]
return (x.index[1:].astype(str) + '-' + x[1:].astype(str))
df['Positives'] = df.apply(formatter, axis=1).apply(', '.join)
print(df)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
2 3 0 0 0
3 4 1 0 0 A-1
If you need to filter out zero-length strings, you can use the fact that empty strings evaluate to False with bool:
res = df[df['Positives'].astype(bool)]
print(res)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
3 4 1 0 0 A-1

I'd replace the zeros with np.NaN to remove things you don't care about and stack. Then form the strings you want and groupby.apply(list)
import numpy as np
df = df.set_index('index') # if 'index' is not your index.
stacked = df.replace(0, np.NaN).stack().reset_index()
stacked['Positives'] = stacked['level_1'] + '-' + stacked[0].astype(int).astype('str')
stacked = stacked.groupby('index').Positives.apply(list).reset_index()
stacked is now:
index Positives
0 1 [B-1, C-2]
1 2 [C-3]
2 4 [A-1]
Or if you just want one string and not a list, change the last line:
stacked.groupby('index').Positives.apply(lambda x: ', '.join(list(x))).reset_index()
# index Positives
#0 1 B-1, C-2
#1 2 C-3
#2 4 A-1

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?

see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6

If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)

df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4

You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2

Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2

A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to conditionally add one hot vector to a Pandas DataFrame - python

You can use np.where to create condition vector and use str.get_dummies to create dummies df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f')) df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1) a b c d e f 0 1 2 3 0 0 1 1 3 2 1 1 0 0 2 2 1 1 0 1 0

Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies()) df Out[29]: a b c d e f 0 1 2 3 0 0 1 1 3 2 1 1 0 0 2 2 1 1 0 1 0

You simply harness the Boolean conditions you've already specified. df["d"] = np.where(df.b > df.c, 1, 0) df["e"] = np.where(df.b == df.c, 1, 0) df["f"] = np.where(df.b < df.c, 1, 0)

Related

Split a column into multiple columns that has value as list

How to convert an M x M pandas DataFrame into an N X 2 Dataframe?

How to drop columns from a dataframe that contain specific values in any row

Creating a string from pandas column and row data

how do I insert a column at a specific column index in pandas?

Categories

Resources