Concatenate columns to one column with conditions pandas - python

I would like to concatenate all the columns of the dataset:
df = pd.DataFrame([['0987', 4, 'j'], ['9', 4, 'y'], ['9', 6, 't'], ['4', '', 'o'], ['', 9, 'o']],
columns=['col_a', 'col_b', 'col_c'])
In [1]:
col_a col_b col_c
0 0987 4 j
1 9 4 y
2 9 6 t
3 4 u
4 9 o
Into one column with the added condition. The first is that all empty or null entries must be removed or not added to the new set. The second is that if the entry in the new column (col_new) comes from col_a or col_c it must have a label of 1. Otherwise it must be labelled 0.
So I would like the result to look like this:
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
4 4 0
5 4 0
6 6 0
7 9 0
8 j 1
9 y 1
10 t 1
11 u 1
12 o 1

Use DataFrame.melt, also for new label column use rename with lambda function and last filter rows by DataFrame.query:
df = (df.rename(columns = lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')
.query('col_new != ""')[['col_new','label']])
print (df)
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
5 4 0
6 4 0
7 6 0
9 9 0
10 j 1
11 y 1
12 t 1
13 o 1
14 o 1
If there are missing values:
df = pd.DataFrame([['0987', 4, 'j'], ['9', 4, 'y'], ['9', 6, 't'],
['4', np.nan, 'o'], [np.nan, 9, 'o']],
columns=['col_a', 'col_b', 'col_c'])
df = (df.rename(columns= lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')
.query('col_new == col_new')[['col_new','label']])
Or use DataFrame.dropna for filtering:
df = (df.rename(columns= lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')[['col_new','label']])
df = df.dropna(subset=['col_new'])
print (df)
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
5 4 0
6 4 0
7 6 0
9 9 0
10 j 1
11 y 1
12 t 1
13 o 1
14 o 1

Related

How to get the cumulative count based on two columns

Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()?
The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.
import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
After looking at #TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2

Combine Row Index and Row Value (String) For Specific Rows in Pandas DataFrame

Let's say that I have a Pandas DataFrame:
df = pd.DataFrame({'col1': range(10), 'col2': ['a', 'b', 'c', 'a', 'e', 'f', 'g', 'a', 'h', 'i']})
and it looks like:
col1 col2
0 0 a
1 1 b
2 2 c
3 3 a
4 4 e
5 5 f
6 6 g
7 7 a
8 8 h
9 9 i
I want to update all values where df['col2'] == 'a' and append the row index to a so that we get:
col1 col2
0 0 a_0
1 1 b
2 2 c
3 3 a_3
4 4 e
5 5 f
6 6 g
7 7 a_7
8 8 h
9 9 i
Use series.mask with series.eq to compare if value equals a and add col1/index after converting to string
df['col2']=df['col2'].mask(df['col2'].eq('a'),df['col2'].add('_'+df.index.astype(str)))
#df['col2']=df['col2'].mask(df['col2'].eq('a'),df['col2'].add('_'+df['col1'].astype(str)))
print(df)
col1 col2
0 0 a_0
1 1 b
2 2 c
3 3 a_3
4 4 e
5 5 f
6 6 g
7 7 a_7
8 8 h
9 9 i
Comprehension
df.assign(col2=[f'{v}_{i}' if v == 'a' else v for i, v in df.col2.iteritems()])
col1 col2
0 0 a_0
1 1 b
2 2 c
3 3 a_3
4 4 e
5 5 f
6 6 g
7 7 a_7
8 8 h
9 9 i
A very low memory footprint loop that edits in place
for i, v in df.col2.iteritems():
if v == 'a':
df.at[i, 'col2'] = f'a_{i}'

Compute a new column as delta to another value in pandas dataframe

I have this data frame:
rank cost brand city
0 1 1 a x
1 2 2 a x
2 3 3 a x
3 4 4 a x
4 5 5 a x
5 1 2 b y
6 2 4 b y
7 3 6 b y
8 4 8 b y
9 5 10 b y
I want to create a new column 'delta' which contains the cost difference compared to rank 1 for a certain brand-city combination.
Desired outcome:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
This answer gave me some hints, but I am stuck on the fact that I cannot map a series to a multi-index.
To save on typing, here is some code:
data = {'rank': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'cost': [1, 2, 3, 4, 5, 2, 4, 6, 8, 10],
'brand': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
'city': ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
'delta': ['0', '1', '2', '3', '4', '0', '2', '4', '6', '8']
}
This is transform + first
df['delta']=df.cost-df.groupby(['brand','city'])['cost'].transform('first')
df
Out[291]:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
Use groupby with apply
data['delta'] = (data.groupby(['brand', 'city'], group_keys=False)
.apply(lambda x: x['cost'] - x[x['rank'].eq(1)]['cost'].values[0]))
data
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
solution without using groupby. it sorts rank and uses pd.merge_ordered and assign to create delta column
In [1077]: pd.merge_ordered(data.sort_values(['brand', 'city', 'rank']), data.query('rank == 1'), how='left', on=['brand', 'city', 'rank'], fill_method='ffill').assign(delta=lambda x: x.cost_x - x.cost_y).drop('cost_y', 1)
Out[1077]:
brand city cost_x rank delta
0 a x 1 1 0
1 a x 2 2 1
2 a x 3 3 2
3 a x 4 4 3
4 a x 5 5 4
5 b y 2 1 0
6 b y 4 2 2
7 b y 6 3 4
8 b y 8 4 6
9 b y 10 5 8

Selecting the top three columns for each row and saving the results along with index in a dictionary in python

I am using the below code to iterate through rows of a dataframe
Here is the sample dataset :
device_id s2 s41 s47 s14 s24 s36 s4 s23 s10
3 0 0 0 0.002507676 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0
42 0 0 0 0 0 0 0 0 0
61 0 0 0 0 0 0 0 0 0
49 0 0 0 0 0 0 0 0 7.564063476
54 0 0 0 0 0 0 0 0.001098988 0
and sort the top 3 values from each row.
for index, row in df.iterrows():
row_sorted = row.sort_values(ascending=False)
print (index,row_sorted)
here is a sample output
123 s16 1.054018
s17 0.000000
s26 0.000000
I have also tried with the below code:
top_n = 3
pd.DataFrame({n: df.T[col].nlargest(top_n).index.tolist()
for n, col in enumerate(df.T)}).T
to do it all at once but here is the output :
49 s16 s1 s37 -- 49 is the row number here.
As you can see the outputs do not match and the first output is the correct one.
What I am looking for is a final dictionary which contains the index as key and the top 3 columns as values:
{123 : 's16','s17','s26'}
These will be used further down the line to iterate through another dictionary to_map which has the following structure:
ID": ["s26", "International", "E", "B_TV"] from where I will select "E" and "B_TV"
Try this vectorized approach:
Sample DF:
In [80]: df = pd.DataFrame(np.random.randint(10, size=(5,7)), columns=['id']+list('abcdef'))
...: df = df.set_index('id')
...:
In [81]: df
Out[81]:
a b c d e f
id
4 4 0 8 8 4 8
0 2 4 7 3 1 4
9 3 6 5 7 3 4
5 7 6 3 8 9 1
6 3 7 6 1 7 9
Solution:
In [82]: idx = np.argsort(df.values, axis=1)[:, ::-1][:, :3]
In [83]: pd.DataFrame(np.take(df.columns, idx), index=df.index).T.to_dict('l')
Out[83]:
{0: ['c', 'f', 'b'],
4: ['f', 'd', 'c'],
5: ['e', 'd', 'a'],
6: ['f', 'e', 'b'],
9: ['d', 'b', 'c']}
PS replace [:, :3] with [:, :top_n]

cumcout groupby --- how to list by groups

My question is related to this question
import pandas as pd
df = pd.DataFrame(
[['A', 'X', 3], ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Y', 1],
['B', 'X', 3], ['B', 'X', 1], ['B', 'X', 3], ['B', 'Y', 1],
['C', 'X', 7], ['C', 'Y', 4], ['C', 'Y', 1], ['C', 'Y', 6]],
columns=['c1', 'c2', 'v1'])
df['CNT'] = df.groupby(['c1', 'c2']).cumcount()+1
I got column 'CNT'. But I'd like to break it apart according to group 'c2' to obtain cumulative count of 'X' and 'Y' respectively.
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
Any suggestions? I am just starting to explore Pandas and appreciate your help.
I don't directly know a way to do this directly, but starting from the calculated CNT column, you can do it as follows:
Make the Xcnt and Ycnt columns:
In [13]: df['Xcnt'] = df['CNT'][df['c2']=='X']
In [14]: df['Ycnt'] = df['CNT'][df['c2']=='Y']
In [15]: df
Out[15]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 NaN
1 A X 5 2 2 NaN
2 A Y 7 1 NaN 1
3 A Y 1 2 NaN 2
4 B X 3 1 1 NaN
5 B X 1 2 2 NaN
6 B X 3 3 3 NaN
7 B Y 1 1 NaN 1
8 C X 7 1 1 NaN
9 C Y 4 1 NaN 1
10 C Y 1 2 NaN 2
11 C Y 6 3 NaN 3
Next, we want to fill the NaN's per group of c1 by forward filling:
In [23]: df['Xcnt'] = df.groupby('c1')['Xcnt'].fillna(method='ffill')
In [24]: df['Ycnt'] = df.groupby('c1')['Ycnt'].fillna(method='ffill').fillna(0)
In [25]: df
Out[25]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
For the Ycnt an extra fillna was needed to fill the convert the NaN's to 0's where the group started with NaNs (couldn't fill forward).

Categories

Resources