Pandas: How to delete duplicates in rows and do multiple topic matching - python

I have the following dataframe dfstart where the first column holds different comments containing a variety of different topics. The labels column contains keywords that are associated with the topics.
Using a second dataframe matchlist
I want to create the final dataframe dffinal where for each comment you can see both the labels and the topics that occur in that comment. I also want the labels to only occur once per row.
I tried eliminating the duplicate labels through a for loop
for label in matchlist['label']:
if dfstart[label[n]] == dfstart[label[n-1]]:
dfstart['label'] == np.nan
However, this doesn't seem to work. Further, I have manged to merge dfstart with matchlist to have the first topic displayed in the dataframe. The code I used for that is
df2 = pd.merge(df, matchlist, on='label1')
Of course, I could keep renaming the label column in matchlist and keep repeating the process, but this would take a long time and would not be efficient because my real dataframe is much larger than this toy example. So I was wondering if there was a more elegant way to do this.
Here are three toy dataframes:
d = {'comment':["comment1","comment2","comment3"], 'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
dfstart = pd.DataFrame(data=d)
dfstart[['label1','label2', 'label3']] = dfstart.label.str.split(",",expand=True,)
d3 = {'label':["boxing","election","rain"], 'topic': ["sport","politics","weather"]}
matchlist = pd.DataFrame(data=d3)
d2 = {'comment':["comment1","comment2","comment3"],'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"], 'label1':["boxing", "boxing", "election"], 'label2':["election", np.nan, "rain"], 'label3':["rain", np.nan, np.nan], 'topic1':["sports", "sports", "politics"], 'topic2':["politics", np.nan, "weather"], 'topic3':["weather", np.nan, np.nan]}
dffinal = pd.DataFrame(data=d2)
Thanks for your help!

Use str.extractall instead of str.split so you can obtain all matches in one go, then flatten the results and map to your matchlist, finally concat all together:
d = {'comment':["comment1","comment2","comment3"],
'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
df = pd.DataFrame(d)
matchlist = pd.DataFrame({'label':["boxing","election","rain"], 'topic':["sport","politics","weather"]})
s = matchlist.set_index("label")["topic"]
found = (df["label"].str.extractall("|".join(f"(?P<label{num}>{i})" for num, i in enumerate(s.index, 1)))
.groupby(level=0).first())
print (pd.concat([df, found,
found.apply(lambda d: d.map(s))
.rename(columns={f"label{i+1}":f"topic{i+1}" for i in range(1, 4)})], axis=1) )
comment label label1 label2 label3 label1 topic2 topic3
0 comment1 boxing, election, rain boxing election rain sport politics weather
1 comment2 boxing, boxing boxing NaN NaN sport NaN NaN
2 comment3 election, rain, election NaN election rain NaN politics weather

Related

Pandas Dataframe Getting a count of semi-unique values from columns in a CSV

I don't think my title accurately conveys my question but I struggled on it for a bit.
I have a range of CSV files. These files contain column names and values. My current code works exactly as I want it to, in that it groups the data by time and then gets me a count of uses per hour and revenue per hour.
However I now want to refine this, in my CSV there is a column name called Machine Name. Each value in this column is unique, but they share the same naming scheme. They can either be Dryer #39 or Dryer #38 or Washer #1 or Washer #12. What I want is to get a count of Dryers and Washers used per hour and I do not care what number washer or dryer it was. Just that it was a washer or dryer.
Here is my code.
for i in range(1): # len(csvList))
df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index() # Reset the index for the timestamp column
for j in df.iterrows():
dbInsert = """INSERT INTO `store-machine-use`(store_id, timestamp, machines_used_per_hour, revenue_per_hour, notes) VALUES (%s, %s, %s, %s, %s)"""
values = (int(storeNumberList[i]), str(j[1]['Timestamp']), int(j[1]['machines_used_per_hour']), int(j[1]['revenue_per_hour']),'')
cursor.execute(dbInsert, values)
cnx.commit()
This data enters the database and looks like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, Test
I want to get an individual count of the types of machines used every hour, in the case of my example it would look like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, washers_per_hour, dryers_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, 1, 3, Test
you cout use pd.Series.str.startswith and then sum in the aggregation:
df['is_dryer'] = df['Machine Name'].startswith('Dryer')
df['is_washer'] = df['Machine Name'].startswith('Washer')
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum'),
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index() # Reset the index for the timestamp column
note that if you need more complex pattern matching for determining which machine belongs to which category, you can use regexes with pd.Series.str.match
example
for instance with some fake data, if I have:
dataframe = pd.DataFrame(
{"machine": ["Dryer #1", "Dryer #2", "Washer #43", "Washer #89", "Washer #33"],
"aggregation_key": [1, 2, 1, 2, 2]}
)
after creating the boolean columns with
dataframe["is_dryer"] = dataframe.machine.str.startswith("Dryer")
dataframe["is_washer"] = dataframe.machine.str.startswith("Washer")
dataframe will be
machine aggregation_key is_dryer is_washer
0 Dryer #1 1 True False
1 Dryer #2 2 True False
2 Washer #43 1 False True
3 Washer #89 2 False True
4 Washer #33 2 False True
and then aggregation gives you what you want:
dataframe.groupby(dataframe["aggregation_key"]).agg(
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index()
result will be
aggregation_key washers_per_hour dryers_per_hour
0 1 1 1
1 2 2 1
you can use regex to replace the common machine number identifier pattern to create a machine_type series which you can then use to aggregate on.
df['Machine Type'] = df['Machine Name'].str.replace(' #[0-9]', '', regex=True)
you can then group on the Machine Type
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Type', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index()

Create set with exclusion rules from a dataframe

I would like to create groups from my data frame.
Teams with 1 in the corresponding row/column cannot stay in the same group.
How to create the largest groups and fond the minimum number of groups?
Idea
There are 5 teams (50 in the original dataframe) for some reason some teams have players in common. The data frame shows with 1 if two teams have a player in common if not, the cell is filled with nan.
How many and which teams can play together at the same time?
Here a sample data frame
pd.DataFrame.from_dict(data={'team1': [np.nan,1.0,1.0,np.nan,np.nan],
'team2':[1.0,np.nan,np.nan,np.nan,np.nan],
'team3':[1.0,np.nan,np.nan,np.nan,1.0],
'team4':[np.nan,np.nan,np.nan,np.nan,np.nan],
'team5':[np.nan,np.nan,1.0,np.nan,np.nan]}, orient='index',
columns=['team1', 'team2', 'team3', 'team4', 'team5'])
team1
team2
team3
team4
team5
team1
NaN
1.0
1.0
NaN
NaN
team2
1.0
NaN
NaN
NaN
NaN
team3
1.0
NaN
NaN
NaN
1.0
team4
NaN
NaN
NaN
NaN
NaN
team5
NaN
NaN
1.0
NaN
NaN
Expected output
In this easy case the minimum number pf groups is 2, and the possible solution is:
group1 = ['team1', 'team4', 'team5']
group2 = ['team2', 'team3']
Creating these groups is a combinatorial problem, and therefore not one you should try to solve using only pandas. One way of creating the groups is to model it as an assignment problem (for more information, see for instance https://developers.google.com/optimization/assignment/overview), which is what I tried below. As such, I introduce some additional technology, specifically ortools' CP-SAT solver.
This is the code:
import numpy as np
import pandas as pd
from ortools.sat.python import cp_model
def solve(teams):
model = cp_model.CpModel()
num_teams = len(teams)
# Objective function: minimize Z, the total number of groups, which is an integer.
# Incrementing Z amounts to creating a new group, with a new integer value.
z = model.NewIntVar(1, num_teams, '')
model.Minimize(z)
# Create integer variables: one for each team, storing the integer representing the group to which that team belongs
x = {}
for i in range(num_teams):
x[i] = model.NewIntVar(1, num_teams, '')
# Constraint set 1: one constraint for each combination of teams that is not allowed to be in the same group
idx = np.triu_indices(num_teams, k=1)
for i, j in zip(*idx):
if teams[i][j]:
model.Add(x[i] != x[j])
# Constraint 2: linking z, the group "counter", to our team variables
for i in range(num_teams):
model.Add(z - x[i] >= 0)
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
print(f"Total number of groups required: {solver.ObjectiveValue()}")
for i in range(num_teams):
print(f"Team {i + 1}'s group: {solver.Value(x[i])}")
if __name__ == '__main__':
team_data = pd.DataFrame.from_dict(data={'team1': [np.nan, 1.0, 1.0, np.nan, np.nan],
'team2': [1.0, np.nan, np.nan, np.nan, np.nan],
'team3': [1.0, np.nan, np.nan, np.nan, 1.0],
'team4': [np.nan, np.nan, np.nan, np.nan, np.nan],
'team5': [np.nan, np.nan, 1.0, np.nan, np.nan]}, orient='index',
columns=['team1', 'team2', 'team3', 'team4', 'team5'])
team_data = team_data.fillna(0)
solve(team_data.to_numpy())
Running this on the input data you provided produces:
Total number of groups required: 2.0
Team 1's group: 1
Team 2's group: 2
Team 3's group: 2
Team 4's group: 1
Team 5's group: 1
When I run this for 50 teams, with randomly generated constraints regarding which teams cannot join the same group, the solver finishes in 15 ~ 20 seconds. I don't know what your use case is, so I don't know whether this is acceptable for you.
Disclaimer: this solution will probably not scale well. If you need to tackle larger problem instances, you'll want a more sophisticated model. For help with that, you might consider posting somewhere like https://cs.stackexchange.com/ or https://math.stackexchange.com/, or you can do some more research by yourself.
I still decided to post my solution, since I think it's relatively easy to understand and can handle the size of problem you seem to be facing (at least if you're planning to run it offline).

How to handle records in dataframe with same ID but some different values in columns in python

I am working on a dataframe using pandas with bank (loan) details for customers. There is a problem because some unique loan id have been recorded 2 times with different values for some of the features respectively. I am attaching a screenshot to be more specific.
Now you see for instance this unique Loan ID has been recorded 2 times. I want to drop the second one with nan values but I can't do it manually because there are 4900 similar cases. any idea?
The problem is not the NaN value, the problem is the double records. I want to drop rows with nan values only for double records not for the entire dataframe
Thanks in advance
Count rows where there are > 1, and then only drop nans where there are > 1 rows.
df['flag'] = df.groupby(['Loan ID', 'Credit ID'])['Loan ID'].transform('count')
df = df.loc[df['flag'] > 1].dropna(subset=['Credit Score', 'Annual Income']).drop('flag', axis=1)
Instead of dropping nan rows, just take the rows where credit score or annual income is not nan:
df = df[df['Credit Score'].notna()]

Melt dataframe with first two rows as variables

apologies but this has me stumped, I thought I could pass the following dataframe into a simple pd.melt using iloc to reference my varaibles but it wasn't working for me (i'll post the error in a moment)
sample df
Date, 0151, 0561, 0522, 0912
0,Date, AVG Review, AVG Review, Review, Review
1,Date NaN NaN NaN NaN
2,01/01/18 2 2.5 4 5
so as you can see, my ID as in the top row, the type of review is in the 2nd row, the date sits in the first column and the observations of the review are in rows on the date.
what I'm trying to do is melt this df to get the following
ID, Date, Review, Score
0151, 01/01/18, Average Review 2
I thought I could be cheeky and just pass the following
pd.melt pd.melt(df,id_vars=[df.iloc[0]],value_vars=df.iloc[1] )
but this threw the error 'Series' objects are mutable, thus they cannot be hashed
I've had a look at similar answers to pd.melt and perhaps reshape or unpivot? but I'm lost on how I should proceed.
any help is much appreciated.
Edit for Nixon :
My first Row has my unique IDs
2nd row has my observation, which in this case is a type of review (average, normal)
3rd row onward has the variables assigned to the above observation - lets call this score.
1st column has my dates which have the score across by row.
An alternative to pd.melt is to set your rows as column levels of a multiindex and then stack them. Your metadata will be stored as an index rather than column though. Not sure if that matters.
df = pd.DataFrame([
['Date', '0151', '0561', '0522', '0912'],
['Date', 'AVG Review', 'AVG Review', 'Review', 'Review'],
['Date', 'NaN', 'NaN', 'NaN', 'NaN'],
['01/01/18', 2, 2.5, 4, 5],
])
df = df.set_index(0)
df.index.name = 'Date'
df.columns = pd.MultiIndex.from_arrays([df.iloc[0, :], df.iloc[1, :]], names=['ID', 'Review'])
df = df.drop(df.index[[0, 1, 2]])
df.stack('ID').stack('Review')
Output:
Date ID Review
01/01/18 0151 AVG Review 2
0522 Review 4
0561 AVG Review 2.5
0912 Review 5
dtype: object
You can easily revert index to columns with reset_index.

Extend a pandas dataframe to include 'missing' weeks

I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].

Categories

Resources