I have a DataFrame with True and False values.
A B C D
0 False True True False
1 False False True False
2 True True False False
I want to fill the true values with column names and false values with 0. How can I do that?
i.e To get the result as
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
First replace booelan to int and then use mask or where with inverting mask by ~:
df = df.astype(int).mask(df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
df = df.astype(int).where(~df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
Thank you John Galt for improvement in new versions of pandas 0.21.x:
df = df.astype(int).mask(df, df.columns, axis=1)
numpy solution:
a = np.tile(df.columns, [len(df.index),1])
print (a)
[['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']]
df = pd.DataFrame(np.where(df.astype(int), a, 0), columns=df.columns, index = df.index)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
pandas 1.5.2
df = df.mask(df.astype(bool), df.columns.to_series(), axis=1)
not astype(int) but astype(bool) or astype('bool')
Otherwise ValueError: Boolean array expected for the condition, not uint8
can't remove .to_series()
Otherwise ValueError: other must be the same shape as self when an ndarray
Related
I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]
I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none
So in Pandas I have the following dataframe
A B C D
0 X
1 Y
0 Y
1 Y
0 X
1 X
I want to move the value in A to either C or D depending on B. The output should be something like this;
A B C D
0 X 0
1 Y 1
0 Y 0
1 Y 1
0 X 0
1 X 1
I've tried using multiple where statements like
df['C'] = np.where(str(df.B).find('X'), df.A, '')
df['D'] = np.where(str(df.B).find('Y'), df.A, '')
But that results in;
A B C D
0 X 0 0
1 Y 1 1
0 Y 0 0
1 Y 1 1
0 X 0 0
1 X 1 1
So I guess it's checking if the value exists in the column at all, which makes sense. Do I need to iterate row by row?
Dont convert to str with find, because it return scalar and 0 is convert to False and another integers to Trues:
print (str(df.B).find('X'))
5
Simpliest is compare values for boolean Series:
print (df.B == 'X')
0 True
1 False
2 False
3 False
4 True
5 True
Name: B, dtype: bool
df['C'] = np.where(df.B == 'X', df.A, '')
df['D'] = np.where(df.B == 'Y', df.A, '')
Another solution with assign + where:
df = df.assign(C=df.A.where(df.B == 'X', ''),
D=df.A.where(df.B == 'Y', ''))
And if need check substrings use str.contains:
df['C'] = np.where(df.B.str.contains('X'), df.A, '')
df['D'] = np.where(df.B.str.contains('Y'), df.A, '')
Or:
df['C'] = df.A.where(df.B.str.contains('X'), '')
df['D'] = df.A.where(df.B.str.contains('Y'), '')
All return:
print (df)
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
Using slice assignment
n = len(df)
f, u = pd.factorize(df.B.values)
a = np.empty((n, 2), dtype=object)
a.fill('')
a[np.arange(n), f] = df.A.values
df.loc[:, ['C', 'D']] = a
df
A B C D
0 0 X 0
1 1 Y 1
2 0 Y 0
3 1 Y 1
4 0 X 0
5 1 X 1
I need some help with a concise and first of all efficient formulation in pandas of the following operation:
Given a data frame of the format
id a b c d
1 0 -1 1 1
42 0 1 0 0
128 1 -1 0 1
Construct a data frame of the format:
id one_entries
1 "c d"
42 "b"
128 "a d"
That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1.
Here's one way using boolean rule and applying lambda func.
In [58]: df
Out[58]:
id a b c d
0 1 0 -1 1 1
1 42 0 1 0 0
2 128 1 -1 0 1
In [59]: cols = list('abcd')
In [60]: (df[cols] > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
Out[60]:
0 c d
1 b
2 a d
dtype: object
You can assign the result to df['one_entries'] =
Details of apply func.
Take first row.
In [83]: x = df[cols].ix[0] > 0
In [84]: x
Out[84]:
a False
b False
c True
d True
Name: 0, dtype: bool
x gives you Boolean values for the row, values greater than zero. x[x] will return only True. Essentially a series with column names as index.
In [85]: x[x]
Out[85]:
c True
d True
Name: 0, dtype: bool
x[x].index gives you the column names.
In [86]: x[x].index
Out[86]: Index([u'c', u'd'], dtype='object')
Same reasoning as John Galt's, but a bit shorter, constructing a new DataFrame from a dict.
pd.DataFrame({
'one_entries': (test_df > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
})
# one_entries
# 1 c d
# 42 b
# 128 a d
I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2
You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2
Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.