How to efficiently rearrange pandas data as follows? - python

I need some help with a concise and first of all efficient formulation in pandas of the following operation:
Given a data frame of the format
id a b c d
1 0 -1 1 1
42 0 1 0 0
128 1 -1 0 1
Construct a data frame of the format:
id one_entries
1 "c d"
42 "b"
128 "a d"
That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1.

Here's one way using boolean rule and applying lambda func.
In [58]: df
Out[58]:
id a b c d
0 1 0 -1 1 1
1 42 0 1 0 0
2 128 1 -1 0 1
In [59]: cols = list('abcd')
In [60]: (df[cols] > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
Out[60]:
0 c d
1 b
2 a d
dtype: object
You can assign the result to df['one_entries'] =
Details of apply func.
Take first row.
In [83]: x = df[cols].ix[0] > 0
In [84]: x
Out[84]:
a False
b False
c True
d True
Name: 0, dtype: bool
x gives you Boolean values for the row, values greater than zero. x[x] will return only True. Essentially a series with column names as index.
In [85]: x[x]
Out[85]:
c True
d True
Name: 0, dtype: bool
x[x].index gives you the column names.
In [86]: x[x].index
Out[86]: Index([u'c', u'd'], dtype='object')

Same reasoning as John Galt's, but a bit shorter, constructing a new DataFrame from a dict.
pd.DataFrame({
'one_entries': (test_df > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
})
# one_entries
# 1 c d
# 42 b
# 128 a d

Related

How do I delete a column that contains only zeros from a given row in pandas

I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]

How can I tell if a dataframe is of mixed type?

I want to assign values to the diagonal of a dataframe. The fastest way I can think of is to use numpy's np.diag_indices and do a slice assignment on the values array. However, the values array is only a view and ready to accept assignment when a dataframe is of a single dtype
Consider the dataframes d1 and d2
d1 = pd.DataFrame(np.ones((3, 3), dtype=int), columns=['A', 'B', 'C'])
d2 = pd.DataFrame(dict(A=[1, 1, 1], B=[1., 1., 1.], C=[1, 1, 1]))
d1
A B C
0 0 1 1
1 1 0 1
2 1 1 0
d2
A B C
0 1 1.0 1
1 1 1.0 1
2 1 1.0 1
Then let's get our indices
i, j = np.diag_indices(3)
d1 is of a single dtype and therefore, this works
d1.values[i, j] = 0
d1
A B C
0 0 1 1
1 1 0 1
2 1 1 0
But not on d2
d2.values[i, j] = 0
d2
A B C
0 1 1.0 1
1 1 1.0 1
2 1 1.0 1
I need to write a function and make it fail when df is of mixed dtype. How do I test that it is? Should I trust that if it is, this assignment via the view will always work?
You could use internal _is_mixed_type method
In [3600]: d2._is_mixed_type
Out[3600]: True
In [3601]: d1._is_mixed_type
Out[3601]: False
Or, check unique dtypes
In [3602]: d1.dtypes.nunique()>1
Out[3602]: False
In [3603]: d2.dtypes.nunique()>1
Out[3603]: True
A bit of de-tour, is_mixed_type checks how blocks are consolidated.
In [3618]: len(d1.blocks)>1
Out[3618]: False
In [3619]: len(d2.blocks)>1
Out[3619]: True
In [3620]: d1.blocks # same as d1.as_blocks()
Out[3620]:
{'int32': A B C
0 0 1 1
1 1 0 1
2 1 1 0}
In [3621]: d2.blocks
Out[3621]:
{'float64': B
0 1.0
1 1.0
2 1.0, 'int64': A C
0 1 1
1 1 1
2 1 1}
def check_type(df):
return len(set(df.dtypes)) == 1
or
def check_type(df):
return df.dtypes.nunique() == 1
You can inspect DataFrame.dtypes to check the types of the columns. For instance:
>>> d1.dtypes
A int64
B int64
C int64
dtype: object
>>> d2.dtypes
A int64
B float64
C int64
dtype: object
Given that there is at least one column, you can thus check this with:
np.all(d1.dtypes == d1.dtypes[0])
For your dataframes:
>>> np.all(d1.dtypes == d1.dtypes[0])
True
>>> np.all(d2.dtypes == d2.dtypes[0])
False
You can of course first check whether there is at least one column. So we can construct a function:
def all_columns_same_type(df):
dtypes = df.dtypes
return not dtypes.empty and np.all(dtypes == dtypes[0])

How to fill true values of a dataframe with column names?

I have a DataFrame with True and False values.
A B C D
0 False True True False
1 False False True False
2 True True False False
I want to fill the true values with column names and false values with 0. How can I do that?
i.e To get the result as
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
First replace booelan to int and then use mask or where with inverting mask by ~:
df = df.astype(int).mask(df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
df = df.astype(int).where(~df, df.columns.to_series(), axis=1)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
Thank you John Galt for improvement in new versions of pandas 0.21.x:
df = df.astype(int).mask(df, df.columns, axis=1)
numpy solution:
a = np.tile(df.columns, [len(df.index),1])
print (a)
[['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']
['A' 'B' 'C' 'D']]
df = pd.DataFrame(np.where(df.astype(int), a, 0), columns=df.columns, index = df.index)
print (df)
A B C D
0 0 B C 0
1 0 0 C 0
2 A B 0 0
pandas 1.5.2
df = df.mask(df.astype(bool), df.columns.to_series(), axis=1)
not astype(int) but astype(bool) or astype('bool')
Otherwise ValueError: Boolean array expected for the condition, not uint8
can't remove .to_series()
Otherwise ValueError: other must be the same shape as self when an ndarray

pandas merge columns to a single time series

I have a data frame with 3 boolean columns:
A B C
0 True False False
1 False True False
2 True Nan False
3 False False True
...
Only one column is true at each time, but there can be Nan.
I would like to get a list of column names where the name is chosen based on the boolean. So for the example above:
['A', 'B', 'A', 'C']
it's a simple matrix operation, not sure how to map it to pandas...
You can use the mul operator between the dataframe and the dataframe columns. That results in True cells containing the column name and False cells empty. Eventually you can just sum the row data:
df.mul(df.columns).sum(axis=1)
Out[44]:
0 A
1 B
2 A
3 C
You can index columns names, i.e. df.columns, with proper indexes:
>>> import numpy as np
>>> df.columns[(df * np.arange(df.values.shape[1])).sum(axis=1)]
Index([u'A', u'B', u'A', u'C'], dtype=object)
Explanation.
Expression
>>> df * np.arange(df.values.shape[1])
A B C
0 0 0 0
1 0 1 0
2 0 0 0
3 0 0 2
calculates for each column a proper index, then matrix is summed row-wize with
>>> (df * np.arange(df.values.shape[1])).sum(axis=1)
0 0
1 1
2 0
3 2
dtype: int32

Assign to selection in pandas

I have a pandas dataframe and I want to create a new column, that is computed differently for different groups of rows. Here is a quick example:
import pandas as pd
data = {'foo': list('aaade'), 'bar': range(5)}
df = pd.DataFrame(data)
The dataframe looks like this:
bar foo
0 0 a
1 1 a
2 2 a
3 3 d
4 4 e
Now I am adding a new column and try to assign some values to selected rows:
df['xyz'] = 0
df.loc[(df['foo'] == 'a'), 'xyz'] = df.loc[(df['foo'] == 'a')].apply(lambda x: x['bar'] * 2, axis=1)
The dataframe has not changed. What I would expect is the dataframe to look like this:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
In my real-world problem, the 'xyz' column is also computated for the other rows, but using a different function. In fact, I am also using different columns for the computation. So my questions:
Why does the assignment in the above example not work?
Is it neccessary to do df.loc[(df['foo'] == 'a') twice (as I am doing it now)?
You're changing a copy of df (a boolean mask of the DataFrame is a copy, see docs).
Another way to achieve the desired result is as follows:
In [11]: df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
Out[11]:
0 0
1 2
2 4
3 0
4 0
dtype: int64
In [12]: df['xyz'] = df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
In [13]: df
Out[13]:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
Perhaps a neater way is just to:
In [21]: 2 * (df1.bar) * (df1.foo == 'a')
Out[21]:
0 0
1 2
2 4
3 0
4 0
dtype: int64

Categories

Resources