Pandas select on multiple columns then replace - python

I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?

Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None

You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.
You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C

python pandas index of ones (1s) at row-wise

From Pandas Dataframe, how to get the index of all ones at the row level?
My data frame has around a hundred columns. here is an example:
a b c d
0 1 0 1 0
1 0 0 0 1
2 1 1 0 1
3 1 1 0 0
4 1 1 1 1
The expected result is
0 a,c
1 d
2 a,b,d
3 a,b
4 a,b,c,d
I found this question on stackoverflow
index of non "NaN" values in Pandas
but it works at the column level
Thanks in advance.
If there are only 1 and 0 values use DataFrame.dot for matrix multiplication with columns names and separator, last remove separator with Series.str.rstrip:
df['e'] = df.dot(df.columns + ', ').str.rstrip(', ')
#if exist another values like 0,1 and compare 1
#df['e'] = df.eq(1).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d e
0 1 0 1 0 a, c
1 0 0 0 1 d
2 1 1 0 1 a, b, d
3 1 1 0 0 a, b
4 1 1 1 1 a, b, c, d
Also for Series use:
s = df.dot(df.columns + ', ').str.rstrip(', ')
print (s)
0 a, c
1 d
2 a, b, d
3 a, b
4 a, b, c, d
dtype: object
Try:
df=df.stack()
df=df.loc[df.eq(1)].reset_index(level=1).groupby(level=0).agg(', '.join)
Outputs:
level_1
0 a, c
1 d
2 a, b, d
3 a, b
4 a, b, c, d

How to delete row when iterating into pandas Dataframe column?

This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7

Changing dummy variable value from 1 to column name, and then creating a list that I can compare rows with

I have a dataframe that looks like this :
A B C
1 0 0
1 1 0
0 1 0
0 0 1
I want to replace all values with the respective column name, so that the data looks like:
A B C
A 0 0
A B 0
0 B 0
0 0 C
Afterwards, I want to create a column that is a list of all column values like so:
A B C D
A 0 0 ['A','0','0']
A B 0 ['A','B','0']
0 B 0 ['0','B','0']
0 0 C ['0','0','C']
Finally, I want to group by column D and count the number of occurrences for each pattern.
You can do with mul
df.mul(df.columns).replace('',0)
Out[63]:
A B C
0 A 0 0
1 A B 0
2 0 B 0
3 0 0 C
#df['D']=df.mul(df.columns).replace('',0).values.tolist()
There must be cleaner ways to achieve this, but the you can use:
for column in df:
df[column] = df[column].astype(str).replace("1", column)
df["D"] = df.values.tolist()
Output:
A B C D
0 A 0 0 [A, 0, 0]
1 A B 0 [A, B, 0]
2 0 B 0 [0, B, 0]
3 0 0 C [0, 0, C]
PS: W-B's answer is the cleaner way.

Changing values in multiple columns of a pandas DataFrame using known column values

Suppose I have a dataframe like this:
Knownvalue A B C D E F G H
17.3413 0 0 0 0 0 0 0 0
33.4534 0 0 0 0 0 0 0 0
what I wanna do is that when Knownvalue is between 0-10, A is changed from 0 to 1. And when Knownvalue is between 10-20, B is changed from 0 to 1,so on so forth.
It should be like this after changing:
Knownvalue A B C D E F G H
17.3413 0 1 0 0 0 0 0 0
33.4534 0 0 0 1 0 0 0 0
Anyone know how to apply a method to change it?
I first bucket the Knownvalue Series into a list of integers equal to its truncated value divided by ten (e.g. 27.87 // 10 = 2). These buckets represent the integer for the desired column location. Because the Knownvalue is in the first column, I add one to these values.
Next, I enumerate through these bin values which effectively gives me tuple pairs of row and column integer indices. I use iat to set the value of the these locations equal to 1.
import pandas as pd
import numpy as np
# Create some sample data.
df_vals = pd.DataFrame({'Knownvalue': np.random.random(5) * 50})
df = pd.concat([df_vals, pd.DataFrame(np.zeros((5, 5)), columns=list('ABCDE'))], axis=1)
# Create desired column locations based on the `Knownvalue`.
bins = (df.Knownvalue // 10).astype('int').tolist()
>>> bins
[4, 3, 0, 1, 0]
# Set these locations equal to 1.
for idx, col in enumerate(bins):
df.iat[idx, col + 1] = 1 # The first column is the `Knownvalue`, hence col + 1
>>> df
Knownvalue A B C D E
0 47.353937 0 0 0 0 1
1 37.460338 0 0 0 1 0
2 3.797964 1 0 0 0 0
3 18.323131 0 1 0 0 0
4 7.927030 1 0 0 0 0
A different approach would be to reconstruct the frame from the Knownvalue column using get_dummies:
>>> import string
>>> new_cols = pd.get_dummies(df["Knownvalue"]//10).loc[:,range(8)].fillna(0)
>>> new_cols.columns = list(string.ascii_uppercase)[:len(new_cols.columns)]
>>> pd.concat([df[["Knownvalue"]], new_cols], axis=1)
Knownvalue A B C D E F G H
0 17.3413 0 1 0 0 0 0 0 0
1 33.4534 0 0 0 1 0 0 0 0
get_dummies does the hard work:
>>> (df.Knownvalue//10)
0 1
1 3
Name: Knownvalue, dtype: float64
>>> pd.get_dummies((df.Knownvalue//10))
1 3
0 1 0
1 0 1

Categories

Resources