pandas merge columns to a single time series

pandas merge columns to a single time series - python

I have a data frame with 3 boolean columns:
A B C
0 True False False
1 False True False
2 True Nan False
3 False False True
...
Only one column is true at each time, but there can be Nan.
I would like to get a list of column names where the name is chosen based on the boolean. So for the example above:
['A', 'B', 'A', 'C']
it's a simple matrix operation, not sure how to map it to pandas...

You can use the mul operator between the dataframe and the dataframe columns. That results in True cells containing the column name and False cells empty. Eventually you can just sum the row data:
df.mul(df.columns).sum(axis=1)
Out[44]:
0 A
1 B
2 A
3 C

You can index columns names, i.e. df.columns, with proper indexes:
>>> import numpy as np
>>> df.columns[(df * np.arange(df.values.shape[1])).sum(axis=1)]
Index([u'A', u'B', u'A', u'C'], dtype=object)
Explanation.
Expression
>>> df * np.arange(df.values.shape[1])
A B C
0 0 0 0
1 0 1 0
2 0 0 0
3 0 0 2
calculates for each column a proper index, then matrix is summed row-wize with
>>> (df * np.arange(df.values.shape[1])).sum(axis=1)
0 0
1 1
2 0
3 2
dtype: int32

Related

Pandas groupby and replace duplicates with empty string

I have a dataframe like the following:
import pandas as pd
d = {'one':[1,1,1,1,2, 2, 2, 2],
'two':['a','a','a','b', 'a','a','b','b'],
'letter':[' a','b','c','a', 'a', 'b', 'a', 'b']}
df = pd.DataFrame(d)
> one two letter
0 1 a a
1 1 a b
2 1 a c
3 1 b a
4 2 a a
5 2 a b
6 2 b a
7 2 b b
And I am trying to convert it to a dataframe like the following, where empty cells are filled with empty string '':
one two letter
1 a a
b
c
b a
2 a a
b
b a
b
When I perform groupby with all columns I get a series object that is basically exactly what I am looking for, but not a dataframe:
df.groupby(df.columns.tolist()).size()
1 a a 1
b 1
c 1
b a 1
2 a a 1
b 1
b a 1
b 1
How can I get the desired dataframe?

You can mask your columns where the value is not the same as the value below, then use where to change it to a blank string:
df[['one','two']] = df[['one','two']].where(df[['one', 'two']].apply(lambda x: x != x.shift()), '')
>>> df
one two letter
0 1 a a
1 b
2 c
3 b a
4 2 a a
5 b
6 b a
7 b
some explanation:
Your mask looks like this:
>>> df[['one', 'two']].apply(lambda x: x != x.shift())
one two
0 True True
1 False False
2 False False
3 False True
4 True True
5 False False
6 False True
7 False False
All that where is doing is finding the values where that is true, and replacing the rest with ''

The solution to the original problem is to find the dublicated cells in each of the first two columns and set them to empty:
df.loc[df.duplicated(subset=['one', 'two']), 'two'] = ''
df.loc[df.duplicated(subset=['one']), 'one'] = ''
However, the purpose of this transformation is unclear. Perhaps you are trying to solve a wrong problem.

How to efficiently rearrange pandas data as follows?

I need some help with a concise and first of all efficient formulation in pandas of the following operation:
Given a data frame of the format
id a b c d
1 0 -1 1 1
42 0 1 0 0
128 1 -1 0 1
Construct a data frame of the format:
id one_entries
1 "c d"
42 "b"
128 "a d"
That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1.

Here's one way using boolean rule and applying lambda func.
In [58]: df
Out[58]:
id a b c d
0 1 0 -1 1 1
1 42 0 1 0 0
2 128 1 -1 0 1
In [59]: cols = list('abcd')
In [60]: (df[cols] > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
Out[60]:
0 c d
1 b
2 a d
dtype: object
You can assign the result to df['one_entries'] =
Details of apply func.
Take first row.
In [83]: x = df[cols].ix[0] > 0
In [84]: x
Out[84]:
a False
b False
c True
d True
Name: 0, dtype: bool
x gives you Boolean values for the row, values greater than zero. x[x] will return only True. Essentially a series with column names as index.
In [85]: x[x]
Out[85]:
c True
d True
Name: 0, dtype: bool
x[x].index gives you the column names.
In [86]: x[x].index
Out[86]: Index([u'c', u'd'], dtype='object')

Same reasoning as John Galt's, but a bit shorter, constructing a new DataFrame from a dict.
pd.DataFrame({
'one_entries': (test_df > 0).apply(lambda x: ' '.join(x[x].index), axis=1)
})
# one_entries
# 1 c d
# 42 b
# 128 a d

select identical entries in two pandas dataframe

I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2

You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2

Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?

see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6

If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)

df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4

You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2

Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2

A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Assign to selection in pandas

I have a pandas dataframe and I want to create a new column, that is computed differently for different groups of rows. Here is a quick example:
import pandas as pd
data = {'foo': list('aaade'), 'bar': range(5)}
df = pd.DataFrame(data)
The dataframe looks like this:
bar foo
0 0 a
1 1 a
2 2 a
3 3 d
4 4 e
Now I am adding a new column and try to assign some values to selected rows:
df['xyz'] = 0
df.loc[(df['foo'] == 'a'), 'xyz'] = df.loc[(df['foo'] == 'a')].apply(lambda x: x['bar'] * 2, axis=1)
The dataframe has not changed. What I would expect is the dataframe to look like this:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
In my real-world problem, the 'xyz' column is also computated for the other rows, but using a different function. In fact, I am also using different columns for the computation. So my questions:
Why does the assignment in the above example not work?
Is it neccessary to do df.loc[(df['foo'] == 'a') twice (as I am doing it now)?

You're changing a copy of df (a boolean mask of the DataFrame is a copy, see docs).
Another way to achieve the desired result is as follows:
In [11]: df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
Out[11]:
0 0
1 2
2 4
3 0
4 0
dtype: int64
In [12]: df['xyz'] = df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
In [13]: df
Out[13]:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
Perhaps a neater way is just to:
In [21]: 2 * (df1.bar) * (df1.foo == 'a')
Out[21]:
0 0
1 2
2 4
3 0
4 0
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas merge columns to a single time series - python

You can use the mul operator between the dataframe and the dataframe columns. That results in True cells containing the column name and False cells empty. Eventually you can just sum the row data: df.mul(df.columns).sum(axis=1) Out[44]: 0 A 1 B 2 A 3 C

Related

Pandas groupby and replace duplicates with empty string

How to efficiently rearrange pandas data as follows?

select identical entries in two pandas dataframe

how do I insert a column at a specific column index in pandas?

Assign to selection in pandas

Categories

Resources