I would like to create groups from my data frame.
Teams with 1 in the corresponding row/column cannot stay in the same group.
How to create the largest groups and fond the minimum number of groups?
Idea
There are 5 teams (50 in the original dataframe) for some reason some teams have players in common. The data frame shows with 1 if two teams have a player in common if not, the cell is filled with nan.
How many and which teams can play together at the same time?
Here a sample data frame
pd.DataFrame.from_dict(data={'team1': [np.nan,1.0,1.0,np.nan,np.nan],
'team2':[1.0,np.nan,np.nan,np.nan,np.nan],
'team3':[1.0,np.nan,np.nan,np.nan,1.0],
'team4':[np.nan,np.nan,np.nan,np.nan,np.nan],
'team5':[np.nan,np.nan,1.0,np.nan,np.nan]}, orient='index',
columns=['team1', 'team2', 'team3', 'team4', 'team5'])
team1
team2
team3
team4
team5
team1
NaN
1.0
1.0
NaN
NaN
team2
1.0
NaN
NaN
NaN
NaN
team3
1.0
NaN
NaN
NaN
1.0
team4
NaN
NaN
NaN
NaN
NaN
team5
NaN
NaN
1.0
NaN
NaN
Expected output
In this easy case the minimum number pf groups is 2, and the possible solution is:
group1 = ['team1', 'team4', 'team5']
group2 = ['team2', 'team3']
Creating these groups is a combinatorial problem, and therefore not one you should try to solve using only pandas. One way of creating the groups is to model it as an assignment problem (for more information, see for instance https://developers.google.com/optimization/assignment/overview), which is what I tried below. As such, I introduce some additional technology, specifically ortools' CP-SAT solver.
This is the code:
import numpy as np
import pandas as pd
from ortools.sat.python import cp_model
def solve(teams):
model = cp_model.CpModel()
num_teams = len(teams)
# Objective function: minimize Z, the total number of groups, which is an integer.
# Incrementing Z amounts to creating a new group, with a new integer value.
z = model.NewIntVar(1, num_teams, '')
model.Minimize(z)
# Create integer variables: one for each team, storing the integer representing the group to which that team belongs
x = {}
for i in range(num_teams):
x[i] = model.NewIntVar(1, num_teams, '')
# Constraint set 1: one constraint for each combination of teams that is not allowed to be in the same group
idx = np.triu_indices(num_teams, k=1)
for i, j in zip(*idx):
if teams[i][j]:
model.Add(x[i] != x[j])
# Constraint 2: linking z, the group "counter", to our team variables
for i in range(num_teams):
model.Add(z - x[i] >= 0)
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
print(f"Total number of groups required: {solver.ObjectiveValue()}")
for i in range(num_teams):
print(f"Team {i + 1}'s group: {solver.Value(x[i])}")
if __name__ == '__main__':
team_data = pd.DataFrame.from_dict(data={'team1': [np.nan, 1.0, 1.0, np.nan, np.nan],
'team2': [1.0, np.nan, np.nan, np.nan, np.nan],
'team3': [1.0, np.nan, np.nan, np.nan, 1.0],
'team4': [np.nan, np.nan, np.nan, np.nan, np.nan],
'team5': [np.nan, np.nan, 1.0, np.nan, np.nan]}, orient='index',
columns=['team1', 'team2', 'team3', 'team4', 'team5'])
team_data = team_data.fillna(0)
solve(team_data.to_numpy())
Running this on the input data you provided produces:
Total number of groups required: 2.0
Team 1's group: 1
Team 2's group: 2
Team 3's group: 2
Team 4's group: 1
Team 5's group: 1
When I run this for 50 teams, with randomly generated constraints regarding which teams cannot join the same group, the solver finishes in 15 ~ 20 seconds. I don't know what your use case is, so I don't know whether this is acceptable for you.
Disclaimer: this solution will probably not scale well. If you need to tackle larger problem instances, you'll want a more sophisticated model. For help with that, you might consider posting somewhere like https://cs.stackexchange.com/ or https://math.stackexchange.com/, or you can do some more research by yourself.
I still decided to post my solution, since I think it's relatively easy to understand and can handle the size of problem you seem to be facing (at least if you're planning to run it offline).
Related
I'm importing the following .xlsx file into a dataframe.
dfMenu = pd.read_excel("/Users/FoodTrucks.xlsx")
Price
Quantity
FoodTruck
FoodTruck_ID
3.00
10
Burgers
1
1.20
50
Tacos
2
0.60
30
Tacos
2
1.12
40
Drinks
4
2.00
20
Burgers
1
My goal is to show the total revenue for each food truck with its ID and name in a new column, called "Revenue".
I am currently trying to use the code below, but I'm struggling with the multiplication of the columns "Price" and "Quantity" into a new one and grouping "FoodTruck" and "FoodTruck_ID" in an elegant way.
df = df.groupby((['FoodTruck', 'FoodTruck_ID'])(df['Revenue'] = df['Price'] * q9['Quantity']))
I am getting a syntax error "SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?"
What would be the most elegant way to solve it?
It will be easier to first calculate Price*Quantity before doing the groupby:
import pandas as pd
df = pd.DataFrame({
'Price': [3.0, 1.2, 0.6, 1.12, 2.0],
'Quantity': [10, 50, 30, 40, 20],
'FoodTruck': ['Burgers', 'Tacos', 'Tacos', 'Drinks', 'Burgers'],
'FoodTruck_ID': [1, 2, 2, 4, 1]
})
df['Revenue'] = df['Price']*df['Quantity']
df.groupby(['FoodTruck','FoodTruck_ID'])['Revenue'].sum()
Output
FoodTruck FoodTruck_ID
Burgers 1 70.0
Drinks 4 44.8
Tacos 2 78.0
Name: Revenue, dtype: float64
I have the following dataframe dfstart where the first column holds different comments containing a variety of different topics. The labels column contains keywords that are associated with the topics.
Using a second dataframe matchlist
I want to create the final dataframe dffinal where for each comment you can see both the labels and the topics that occur in that comment. I also want the labels to only occur once per row.
I tried eliminating the duplicate labels through a for loop
for label in matchlist['label']:
if dfstart[label[n]] == dfstart[label[n-1]]:
dfstart['label'] == np.nan
However, this doesn't seem to work. Further, I have manged to merge dfstart with matchlist to have the first topic displayed in the dataframe. The code I used for that is
df2 = pd.merge(df, matchlist, on='label1')
Of course, I could keep renaming the label column in matchlist and keep repeating the process, but this would take a long time and would not be efficient because my real dataframe is much larger than this toy example. So I was wondering if there was a more elegant way to do this.
Here are three toy dataframes:
d = {'comment':["comment1","comment2","comment3"], 'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
dfstart = pd.DataFrame(data=d)
dfstart[['label1','label2', 'label3']] = dfstart.label.str.split(",",expand=True,)
d3 = {'label':["boxing","election","rain"], 'topic': ["sport","politics","weather"]}
matchlist = pd.DataFrame(data=d3)
d2 = {'comment':["comment1","comment2","comment3"],'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"], 'label1':["boxing", "boxing", "election"], 'label2':["election", np.nan, "rain"], 'label3':["rain", np.nan, np.nan], 'topic1':["sports", "sports", "politics"], 'topic2':["politics", np.nan, "weather"], 'topic3':["weather", np.nan, np.nan]}
dffinal = pd.DataFrame(data=d2)
Thanks for your help!
Use str.extractall instead of str.split so you can obtain all matches in one go, then flatten the results and map to your matchlist, finally concat all together:
d = {'comment':["comment1","comment2","comment3"],
'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
df = pd.DataFrame(d)
matchlist = pd.DataFrame({'label':["boxing","election","rain"], 'topic':["sport","politics","weather"]})
s = matchlist.set_index("label")["topic"]
found = (df["label"].str.extractall("|".join(f"(?P<label{num}>{i})" for num, i in enumerate(s.index, 1)))
.groupby(level=0).first())
print (pd.concat([df, found,
found.apply(lambda d: d.map(s))
.rename(columns={f"label{i+1}":f"topic{i+1}" for i in range(1, 4)})], axis=1) )
comment label label1 label2 label3 label1 topic2 topic3
0 comment1 boxing, election, rain boxing election rain sport politics weather
1 comment2 boxing, boxing boxing NaN NaN sport NaN NaN
2 comment3 election, rain, election NaN election rain NaN politics weather
Basically:
Is there a way to apply a function that uses the column name of a dataframe in Pandas?
Like this:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column that the apply is 'processing'.
Details:
I'd like to create a label for each row of a dataframe, based on a dictionary.
Let's take the dataframe df:
df = pd.DataFrame({ 'Application': ['Compressors', 'Fans', 'Fans', 'Material Handling'],
'HP': ['0.25', '0.25', '3.0', '15.0'],
'Sector': ['Commercial', 'Industrial', 'Commercial', 'Residential']},
index=[0, 1, 2, 3])
After I apply the label:
In [139]: df['label'] = df.apply(lambda x: '_'.join(x), axis=1)
In [140]: df
Out[140]:
Application HP Sector label
0 Compressors 0.25 Commercial Compressors_0.25_Commercial
1 Fans 0.25 Industrial Fans_0.25_Industrial
2 Fans 3.0 Commercial Fans_3.0_Commercial
3 Material Handling 15.0 Residential Material Handling_15.0_Residential
But the label is too long, especially when I consider the full dataframe, which contains a lot more columns. What I want is to use a dictionary to shorten the fields that come from the columns (I pasted the code for the dictionary at the end of the question).
I can do that for one field:
In [145]: df['application_label'] = df['Application'].apply(
lambda x: labels_dict['Application'][x])
In [146]: df
Out[146]:
Application HP Sector application_label
0 Compressors 0.25 Commercial cmp
1 Fans 0.25 Industrial fan
2 Fans 3.0 Commercial fan
3 Material Handling 15.0 Residential mat
But I want to do it for all the fields, like I did in snippet #2. So I'd like to do something like:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column of df to which the function is being applied. Is there a way to access that information?
Thank you for your help!
I defined the dictionary as:
In [141]: labels_dict
Out[141]:
{u'Application': {u'Compressors': u'cmp',
u'Fans': u'fan',
u'Material Handling': u'mat',
u'Other/General': u'oth',
u'Pumps': u'pum'},
u'ECG': {u'Polyphase': u'pol',
u'Single-Phase (High LRT)': u'sph',
u'Single-Phase (Low LRT)': u'spl',
u'Single-Phase (Med LRT)': u'spm'},
u'Efficiency Level': {u'EL0': u'el0',
u'EL1': u'el1',
u'EL2': u'el2',
u'EL3': u'el3',
u'EL4': u'el4'},
u'HP': {0.25: 1.0,
0.33: 2.0,
0.5: 3.0,
0.75: 4.0,
1.0: 5.0,
1.5: 6.0,
2.0: 7.0,
3.0: 8.0,
10.0: 9.0,
15.0: 10.0},
u'Sector': {u'Commercial': u'com',
u'Industrial': u'ind',
u'Residential': u'res'}}
I worked out one way to do it, but it seems clunky. I'm hoping there's something more elegant out there.
df['label'] = pd.DataFrame([df[column_name].apply(lambda x: labels_dict[column_name][x])
for column_name in df.columns]).apply('_'.join)
I would say this is a bit more elegant
df.apply(lambda x: '_'.join([str(labels_dict[col][v]) for col, v in zip(df.columns, x)]), axis=1)
I was hoping someone could help me convert my current dataframe from a wide to long format. I am using Pandas 0.18.0 and I can't seem to find any other solution on stackoverflow that fits my need.
Any help would be greatly appreciated!
I have 50 steps each with two categories(status/time) that I need to melt, these categories alternate in my dataframe. Below is an example with only 3 sets but this pattern continues until it reaches 50.
status can be either: yes/no/NaN
time can be either: timestamp/NaN
Current Dataframe:
cl_id cl_template_id status-1 time-1 status-2 time-2 status-3 time-3
0 18434 107 NaN NaN NaN NaN NaN NaN
1 18280 117 yes 2016-12-28T18:21:58+00:00 yes 2016-12-28T20:47:31+00:00 yes 2016-12-28T20:47:32+00:00
2 18356 413 yes 2017-01-11T19:23:10+00:00 yes 2017-01-11T19:23:11+00:00 yes 2017-01-11T19:23:11+00:00
3 18358 430 NaN NaN NaN NaN NaN NaN
4 18359 430 yes 2017-01-11T19:20:32+00:00 yes 2017-01-11T19:20:34+00:00 NaN NaN
.
.
.
Target Dataframe:
cl_id cl_template_id step status time
18434 107 1 NaN NaN
18434 107 2 NaN NaN
18434 107 3 NaN NaN
18280 117 1 yes 2016-12-28T18:21:58+00:00
18280 117 2 yes 2016-12-28T20:47:31+00:00
18280 117 3 yes 2016-12-28T20:47:32+00:00
18356 413 1 yes 2017-01-11T19:23:10+00:00
18356 413 2 yes 2017-01-11T19:23:11+00:00
18356 413 3 yes 2017-01-11T19:23:11+00:00
.
.
.
Hopefully this answer provides some insight to the problem.
First, I'll recreate an example from your dataframe:
# Make example dataframe
df = pd.DataFrame({'cl_id' : [18434, 18280, 18356, 18358, 18359],
'cl_template_id' : [107, 117, 413, 430, 430],
'status_1' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
'time_1' : [np.NaN, '2016-12-28T18:21:58+00:00', '2017-01-11T19:23:10+00:00', np.NaN, '2017-01-11T19:20:32+00:00'],
'status_2' : [np.NaN, 'yes', 'yes', np.NaN, 'yes'],
'time_2' : [np.NaN, '2016-12-28T20:47:31+00:00', '2017-01-11T19:23:11+00:00', np.NaN, '2017-01-11T19:20:34+00:00'],
'status_3' : [np.NaN, 'yes', 'yes', np.NaN, np.NaN],
'time_3' : [np.NaN, '2016-12-28T20:47:32+00:00', '2017-01-11T19:23:11+00:00', np.NaN, np.NaN]})
Second, convert time_1,2,3 into datetimes:
# Convert time_1,2,3 to datetime
df.loc[:, 'time_1'] = pd.to_datetime(df.loc[:, 'time_1'])
df.loc[:, 'time_2'] = pd.to_datetime(df.loc[:, 'time_2'])
df.loc[:, 'time_3'] = pd.to_datetime(df.loc[:, 'time_3'])
Third, split the dataframe into two, one with status and the other with time:
# Split df into a status, time dataframe
df_status = df.loc[:, :'status_3']
df_time = df.loc[:, ['cl_id', 'cl_template_id']].merge(df.loc[:, 'time_1':],
left_index = True,
right_index = True)
Fourth, melt the status and time dataframes:
# Melt status
df_status = df_status.melt(id_vars = ['cl_id',
'cl_template_id'],
value_vars = ['status_1',
'status_2',
'status_3'],
var_name = 'step',
value_name = 'status')
# Melt time
df_time = df_time.melt(id_vars = ['cl_id',
'cl_template_id'],
value_vars = ['time_1',
'time_2',
'time_3'],
var_name = 'step',
value_name = 'time')
Fifth, clean up the 'step' column in both the status and time dataframes to only keep the number:
# Clean step in status, time
df_status.loc[:, 'step'] = df_status.loc[:, 'step'].str.partition('_')[2]
df_time.loc[:, 'step'] = df_time.loc[:, 'step'].str.partition('_')[2]
Sixth, merge the status and time dataframes back together into the final dataframe:
# Merge status, time back together on cl_id, cl_template_id
final = df_status.merge(df_time,
how = 'inner',
on = ['cl_id',
'cl_template_id',
'step']).sort_values(by = ['cl_template_id',
'cl_id']).reset_index(drop = True)
Voila! The answer you were looking for:
Old thread, but I was facing the same issue and I think this answer by Ted Petrou could have helped you perfectly here: Pandas Melt several groups of columns into multiple target columns by name
pd.wide_to_long(df, stubnames, i, j, sep, suffix)
In short: The pd.wide_to_long() function allows you to specify the common component between the various columns you're trying to unpivot.
For example, my dataframe was similar to yours, as follows:
pd.melt and pd.unstack get you close, but don't allow you to target these incremental groups of columns by their common denominator.
I use data from a past kaggle challenge based on panel data across a number of stores and a period spanning 2.5 years. Each observation includes the number of customers for a given store-date. For each store-date, my objective is to compute the average number of customers that visited this store during the past 60 days.
Below is code that does exactly what I need. However, it lasts forever - it would take a night to process the c.800k rows. I am looking for a clever way to achieve the same objective faster.
I have included 5 observations of the initial dataset with the relevant variables: store id (Store), Date and number of customers ("Customers").
Note:
For each row in the iteration, I end up writing the results using .loc instead of e.g. row["Lagged No of customers"] because "row" does not write anything in the cells. I wonder why that's the case.
I normally populate new columns using "apply, axis = 1" so I would really appreciate any solution based on that. I found that "apply" works fine when for each row, computation is done across columns using values at the same row level. However, I don't know how an "apply" function can involve different rows, which is what this problem requires. the only exception I have seen so far is "diff", which is not useful here.
Thanks.
Sample data:
pd.DataFrame({
'Store': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Customers': {0: 668, 1: 578, 2: 619, 3: 635, 4: 785},
'Date': {
0: pd.Timestamp('2013-01-02 00:00:00'),
1: pd.Timestamp('2013-01-03 00:00:00'),
2: pd.Timestamp('2013-01-04 00:00:00'),
3: pd.Timestamp('2013-01-05 00:00:00'),
4: pd.Timestamp('2013-01-07 00:00:00')
}
})
Code that works but is incredibly slow:
import pandas as pd
import numpy as np
data = pd.read_csv("Rossman - no of cust/dataset.csv")
data.Date = pd.to_datetime(data.Date)
data.Customers = data.Customers.astype(int)
for index, row in data.iterrows():
d = row["Date"]
store = row["Store"]
time_condition = (d - data["Date"]<np.timedelta64(60, 'D')) & (d > data["Date"])
sub_df = data.loc[ time_condition & (data["Store"] == store), :]
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Lagged No customers"] = sub_df["Customers"].sum()
data.loc[ (data["Date"]==d) & (data["Store"] == store), "No of days"] = len(sub_df["Customers"])
if len(sub_df["Customers"]) > 0:
data.loc[ (data["Date"]==d) & (data["Store"] == store), "Av No of customers"] = int(sub_df["Customers"].sum()/len(sub_df["Customers"]))
Given your small sample data, I used a two day rolling average instead of 60 days.
>>> (pd.rolling_mean(data.pivot(columns='Store', index='Date', values='Customers'), window=2)
.stack('Store'))
Date Store
2013-01-03 1 623.0
2013-01-04 1 598.5
2013-01-05 1 627.0
2013-01-07 1 710.0
dtype: float64
By taking a pivot of the data with dates as your index and stores as your columns, you can simply take a rolling average. You then need to stack the stores to get the data back into the correct shape.
Here is some sample output of the original data prior to the final stack:
Store 1 2 3
Date
2015-07-29 541.5 686.5 767.0
2015-07-30 534.5 664.0 769.5
2015-07-31 550.5 613.0 822.0
After .stack('Store'), this becomes:
Date Store
2015-07-29 1 541.5
2 686.5
3 767.0
2015-07-30 1 534.5
2 664.0
3 769.5
2015-07-31 1 550.5
2 613.0
3 822.0
dtype: float64
Assuming the above is named df, you can then merge it back into your original data as follows:
data.merge(df.reset_index(),
how='left',
on=['Date', 'Store'])
EDIT:
There is a clear seasonal pattern in the data for which you may want to make adjustments. In any case, you probably want your rolling average to be in multiples of seven to represent even weeks. I've used a time window of 63 days in the example below (9 weeks).
In order to avoid losing data on stores that just open (and those at the start of the time period), you can specify min_periods=1 in the rolling mean function. This will give you the average value over all available observations for your given time window
df = data.loc[data.Customers > 0, ['Date', 'Store', 'Customers']]
result = (pd.rolling_mean(df.pivot(columns='Store', index='Date', values='Customers'),
window=63, min_periods=1)
.stack('Store'))
result.name = 'Customers_63d_mvg_avg'
df = df.merge(result.reset_index(), on=['Store', 'Date'], how='left')
>>> df.sort_values(['Store', 'Date']).head(8)
Date Store Customers Customers_63d_mvg_avg
843212 2013-01-02 1 668 668.000000
842103 2013-01-03 1 578 623.000000
840995 2013-01-04 1 619 621.666667
839888 2013-01-05 1 635 625.000000
838763 2013-01-07 1 785 657.000000
837658 2013-01-08 1 654 656.500000
836553 2013-01-09 1 626 652.142857
835448 2013-01-10 1 615 647.500000
To more clearly see what is going on, here is a toy example:
s = pd.Series([1,2,3,4,5] + [np.NaN] * 2 + [6])
>>> pd.concat([s, pd.rolling_mean(s, window=4, min_periods=1)], axis=1)
0 1
0 1 1.0
1 2 1.5
2 3 2.0
3 4 2.5
4 5 3.5
5 NaN 4.0
6 NaN 4.5
7 6 5.5
The window is four observations, but note that the final value of 5.5 equals (5 + 6) / 2. The 4.0 and 4.5 values are (3 + 4 + 5) / 3 and (4 + 5) / 2, respectively.
In our example, the NaN rows of the pivot table do not get merged back into df because we did a left join and all the rows in df have one or more Customers.
You can view a chart of the rolling data as follows:
df.set_index(['Date', 'Store']).unstack('Store').plot(legend=False)