dynamic concatenation of columns for finding max - python

Here's my data -
ID,Pay1,Pay2,Pay3,Low,High,expected_output
1,12,21,23,1,2,21
2,21,34,54,1,3,54
3,74,56,76,1,1,74
The goal is to calculate the max Pay of each row as per the Pay column index specified in Low and High columns.
For example, for row 1, calculate the max of Pay1 and Pay2 columns as Low and High are 1 and 2.
I have tried building a dynamic string and then using eval function which is not performing well.

Idea is filter only Pay columns and then using numpy broadcasting select columns by Low and High columns, pass to DataFrame.where and last get max:
df1 = df.filter(like='Pay')
m1 = np.arange(len(df1.columns)) >= df['Low'].to_numpy()[:, None] - 1
m2 = np.arange(len(df1.columns)) <= df['High'].to_numpy()[:, None] - 1
df['expected_output'] = df1.where(m1 & m2, 0).max(axis=1)
print (df)
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74

An alternative; I expect #jezrael's solution to be faster as it is within numpy and pd.wide_to_long is not particularly fast:
grouping = (
pd.wide_to_long(df.filter(regex="^Pay|Low|High"),
i=["Low", "High"],
stubnames="Pay",
j="num")
.query("Low==num or High==num")
.groupby(level=["Low", "High"])
.Pay.max()
)
grouping
Low High
1 1 74
2 21
3 54
Name: Pay, dtype: int64
df.join(grouping.rename("expected_output"), on=["Low", "High"])
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74

Related

Dataframe grouped by ID and perform clustering [duplicate]

I would like to cluster X2 and X3 for group month by using kmeans clustering. I need to cluster combined two variables. Also I would like to assign cluster 0 ,cluster 1 and cluster 2 to "strong","average","weak" according to the mean of each cluster highest means mean strong cluster. Below is my sample data set.
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
I need to automate this and since then I have so many cols I would like to do this on 2 cols (df.loc[:,2:3]) in general. Assigning cluster to each def is
cluster 2="best"
cluster 1="average"
cluster 0="weak"
To find the best cluster find the mean of each column and then sum if it is higest then assign it to "best", lower to average, and lowest to "weak"
Please help thank you
groupby and apply a clustering function
We can group the dataframe by month and cluster the columns X2 and X3 using a custom defined clustering function
cols = df.columns[2:4]
mapping = {0: 'weak', 1: 'average', 2: 'best'}
def cluster(X):
k_means = KMeans(n_clusters=3).fit(X)
return X.groupby(k_means.labels_)\
.transform('mean').sum(1)\
.rank(method='dense').sub(1)\
.astype(int).to_frame()
df['Cluster_id'] = df.groupby('month')[cols].apply(cluster)
df['Cluster_cat'] = df['Cluster_id'].map(mapping)
month X1 X2 X3 Cluster_id Cluster_cat
0 1 30 10 23 0 weak
1 1 42 76 78 1 average
2 1 25 100 95 2 best
3 1 32 23 52 0 weak
4 1 12 65 60 1 average
5 2 10 94 76 2 best
6 2 4 67 68 2 best
7 2 6 24 92 1 average
8 2 5 67 34 0 weak
9 2 10 54 76 2 best
10 2 24 87 34 0 weak
11 2 21 81 12 0 weak

Ranking with no duplicates

I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.

Pandas: Group by and aggregation with function

Assuming that I have a dataframe with the following values:
name start end description
0 ag 20 30 None
1 bgb 21 111 'a'
2 cdd 31 101 None
3 bgb 17 19 'Bla'
4 ag 20 22 None
I want to groupby name and then get average of (end-start) values.
I can use mean (df.groupby(['name'], as_index=False).mean())
but how can I give the mean function the subtraction of two columns (last - first) ?
You can subtract column and then grouping by column df['name']:
df1 = df['end'].sub(df['start']).groupby(df['name']).mean().reset_index(name='diff')
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70
Another idea with new column diff:
df1 = (df.assign(diff = df['end'].sub(df['start']))
.groupby('name', as_index=False)['diff']
.mean())
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70

How to cbind a vector to a dataframe in python and keep the vector name as the name of the new column?

In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]

Pandas: How to find the first valid column among a series of columns

I have a dataset of different sections of a race in a pandas dataframe from which I need to calculate certain features. It looks something like this:
id distance timeto1000m timeto800m timeto600m timeto400m timeto200m timetoFinish
1 1400m 10 21 30 39 50 60
2 1200m 0 19 31 42 49 57
3 1800m 0 0 0 38 49 62
4 1000m 0 0 29 40 48 61
So, what I need to do is for each row find the first timetoXXm column that is non-zero and the correspoding distance XX. For instance, for id=1 that would be 1000m, for id=3 that would be 400m etc.
I can do this with a series of if..elif..else conditions but was wondering if there is a better way of doing this kind of lookup in pandas/numpy?
You can do it like this, first filter the cols of interest and take a slice, then call idxmin on the cols of interest to return the columns where the boolean condition is met:
In [11]:
df_slice = df.ix[:,df.columns.str.startswith('time')]
df_slice[df_slice!=0].idxmin(axis=1)
Out[11]:
0 timeto1000m
1 timeto800m
2 timeto400m
3 timeto600m
dtype: object
In [15]:
df['first_valid'] = df_slice[df_slice!=0].idxmin(axis=1)
df[['id','first_valid']]
Out[15]:
id first_valid
0 1 timeto1000m
1 2 timeto800m
2 3 timeto400m
3 4 timeto600m
use idxmax(1)
df.set_index(['id', 'distance']).ne(0).idxmax(1)
id distance
1 1400m timeto1000m
2 1200m timeto800m
3 1800m timeto400m
4 1000m timeto600m
dtype: object

Categories

Resources