How to fill true values of a dataframe with column names? - python

I have a DataFrame with True and False values.
A B C D
0 False True True False
1 False False True False
2 True True False False
I want to fill the true values with column names and false values with 0. How can I do that?
i.e To get the result as
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0

First replace booelan to int and then use mask or where with inverting mask by ~:
df = df.astype(int).mask(df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
df = df.astype(int).where(~df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
Thank you John Galt for improvement in new versions of pandas 0.21.x:
df = df.astype(int).mask(df, df.columns, axis=1)
numpy solution:
a = np.tile(df.columns, [len(df.index),1])
print (a)
[['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']]
df = pd.DataFrame(np.where(df.astype(int), a, 0), columns=df.columns, index = df.index)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0

pandas 1.5.2
df = df.mask(df.astype(bool), df.columns.to_series(), axis=1)
not astype(int) but astype(bool) or astype('bool')
Otherwise ValueError: Boolean array expected for the condition, not uint8
can't remove .to_series()
Otherwise ValueError: other must be the same shape as self when an ndarray

Related

How do I delete a column that contains only zeros from a given row in pandas

I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]

Pandas select on multiple columns then replace

I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none

Move data from one column to one of two others based on a fourth column in pandas

So in Pandas I have the following dataframe
A B C D
0 X
1 Y
0 Y
1 Y
0 X
1 X
I want to move the value in A to either C or D depending on B. The output should be something like this;
A B C D
0 X 0
1 Y 1
0 Y 0
1 Y 1
0 X 0
1 X 1
I've tried using multiple where statements like
df['C'] = np.where(str(df.B).find('X'), df.A, '')
df['D'] = np.where(str(df.B).find('Y'), df.A, '')
But that results in;
A B C D
0 X 0 0
1 Y 1 1
0 Y 0 0
1 Y 1 1
0 X 0 0
1 X 1 1
So I guess it's checking if the value exists in the column at all, which makes sense. Do I need to iterate row by row?
Dont convert to str with find, because it return scalar and 0 is convert to False and another integers to Trues:
print (str(df.B).find('X'))
5
Simpliest is compare values for boolean Series:
print (df.B == 'X')
0 True
1 False
2 False
3 False
4 True
5 True
Name: B, dtype: bool
df['C'] = np.where(df.B == 'X', df.A, '')
df['D'] = np.where(df.B == 'Y', df.A, '')
Another solution with assign + where:
df = df.assign(C=df.A.where(df.B == 'X', ''),
D=df.A.where(df.B == 'Y', ''))
And if need check substrings use str.contains:
df['C'] = np.where(df.B.str.contains('X'), df.A, '')
df['D'] = np.where(df.B.str.contains('Y'), df.A, '')
Or:
df['C'] = df.A.where(df.B.str.contains('X'), '')
df['D'] = df.A.where(df.B.str.contains('Y'), '')
All return:
print (df)
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
Using slice assignment
n = len(df)
f, u = pd.factorize(df.B.values)
a = np.empty((n, 2), dtype=object)
a.fill('')
a[np.arange(n), f] = df.A.values
df.loc[:, ['C', 'D']] = a
df
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1

How to efficiently rearrange pandas data as follows?

I need some help with a concise and first of all efficient formulation in pandas of the following operation:
Given a data frame of the format
id a b c d
1 0 -1 1 1
42 0 1 0 0
128 1 -1 0 1
Construct a data frame of the format:
id one_entries
1 "c d"
42 "b"
128 "a d"
That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1.
Here's one way using boolean rule and applying lambda func.
In [58]: df
Out[58]:
id a b c d
0 1 0 -1 1 1
1 42 0 1 0 0
2 128 1 -1 0 1
In [59]: cols = list('abcd')
In [60]: (df[cols] > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
Out[60]:
0 c d
1 b
2 a d
dtype: object
You can assign the result to df['one_entries'] =
Details of apply func.
Take first row.
In [83]: x = df[cols].ix[0] > 0
In [84]: x
Out[84]:
a False
b False
c True
d True
Name: 0, dtype: bool
x gives you Boolean values for the row, values greater than zero. x[x] will return only True. Essentially a series with column names as index.
In [85]: x[x]
Out[85]:
c True
d True
Name: 0, dtype: bool
x[x].index gives you the column names.
In [86]: x[x].index
Out[86]: Index([u'c', u'd'], dtype='object')
Same reasoning as John Galt's, but a bit shorter, constructing a new DataFrame from a dict.
pd.DataFrame({
'one_entries': (test_df > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
})
# one_entries
# 1 c d
# 42 b
# 128 a d

select identical entries in two pandas dataframe

I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2
You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2
Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.

Categories

Resources