Grouping and Summing Multiple Columns in a DataFrame

Grouping and Summing Multiple Columns in a DataFrame - python

I have taken the following sample of data.
trip_id,vehicle_id,customer_id,fleet,trip_start,distance_miles,journey_duration
1,d3550e496af4,442342ac078e,Salt Lake City,2020-06-02 16:12:22,2.30266927956152,0 days 00:13:12.549351
2,2afc10228a2b,4d3ea6d8bb4b,Provo,2020-06-02 16:17:21,0.495335235709548,0 days 00:02:48.407770
3,442342ac078e,442342ac078e,Salt Lake City,2020-06-02 16:43:05,0.7933172567617909,0 days 00:15:33.417755
4,8701da8e6582,567c93d144ed,Provo,2020-06-02 19:34:40,0.9158009891104686,0 days 00:07:04.912849
5,b70fa4bc1486,391526cd2b71,Provo,2020-06-02 20:02:37,1.6248457639858709,0 days 00:11:51.821411
6,f6f0a689fc3a,2b9d754d1c4f,Provo,2020-06-02 20:57:27,0.8310125874177197,0 days 00:07:37.959237
I read this data into a df using:
df = pd.read_clipboard(sep=',')
What I'm struggling to figure out is how to create a summary table using this information. The output df I would like is below:
Here, I want to group by city, while being able to calculate the total number of unique vehicles, unique customers, trips, and the sum of the total distance and duration (in minutes) of every journey in that city.
For example, we can see that for row 0 and 2, there are 2 unique vehicles but it's from the same customer.
I have tried using groupby/summing/unique methods but have had issues when it comes to certain values I want to obtain. Any idea of where to go next? Cheers

You need to convert a few columns and then you can just group and summarise
df['trip_start'] = pd.to_datetime(df['trip_start'], format='%Y-%d-%m %H:%M:%S')
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])
df['Date'] = df['trip_start'].dt.strftime('%b %Y')
df.groupby(['Date', 'fleet']).agg(
Total_Customers = ('customer_id', 'nunique'),
Total_Vehicles = ('vehicle_id', 'nunique'),
Total_Trips = ('trip_id', 'nunique'),
Total_Distance = ('distance_miles', 'sum'),
Total_Duration = ('journey_duration', 'sum'),
)

Related

Keep rows based in more than 2 columns

I have a database similar to the one I created below (with 982977 rows × 10 columns), and I wanted to keep the rows where the exams of the same patient (ID) that are different from "COVID" have been performed in a specific period based on the date of the "COVID" exam.
I created 2 columns, one with dates 7 days before and one with 30 days after the original exam date.
Ex: If the patient had an iron exam between 7 days before and 30 days after the date of their COVID exam, then I would keep that patient, otherwise, I would remove.
I did a for loop, but since the database is big, it took almost 6h to complete and when it finished, I lost the connection to the server, and I couldn't continue to manipulate the data
Is there a simpler and/or faster way to do this?
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = [2021-02-22,2021-02-20,2021-06-22,2021-05-22,2021-05-29]
Date7 = [2021-02-15,2021-02-13,2021-06-15,2021-05-15,2021-05-22]
Date30 = [2021-03-24,2021-03-22,2021-07-24,2021-05-22,2021-06-29]
teste = list(zip(ID, Exam, Date, Date7, Date30))
teste2 = pd.DataFrame(teste, columns=['ID','Exam','Date', 'Date7', 'Date30'])
All the dates columns are in datetime already
pacients = []
for pacient in teste2.ID.unique():
a = teste2[teste2.ID==pacient]
b = a[a.Exam!="COVID"]
c = a[a.Exam=="COVID"]
for exam_covid in b.Data:
for covid_7 in c.Data7:
for covid_30 in c.Data30:
if covid_7 < exam_covid < covid_30:
pacients.append(pacient)
pacients = set(pacientes)
pacients = list(pacientes)

With the following sample dataframe named df
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = ['2021-02-22','2021-02-20','2021-06-22','2021-05-22','2021-06-29']
df = pd.DataFrame({'ID': ID, 'Exam': Exam, 'Date': pd.to_datetime(Date)})
you could try the following:
Step: Create a dataframe df_cov that covers all the time intervals around the Covid exams:
df_cov = df[['ID', 'Date']][df.Exam.eq('COVID')]
df_cov = df_cov.assign(
Before=df_cov.Date - pd.Timedelta(days=7),
After=df_cov.Date + pd.Timedelta(days=30)
).drop(columns='Date')
Step: merge the non-Covid-exams in df with df_cov on the column ID, then select the exams that are within the intervals (here with query), and then extract the remaining unique IDs:
patients = (
df[df.Exam.ne('COVID')].merge(df_cov, on='ID', how='left')
.query('(Before < Date) & (Date < After)')
.ID.unique()
)
Result for the sample (I've changed the last exam date such that it won't fall into the required time interval):
array(['1'], dtype=object)

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:

Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2

First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)

It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days

you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')

It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

check for date and time between two columns in pandas data frame

I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?

You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!

I need help making pandas perform better with dataframe interactions

I'm a newbie and have been studying pandas for a few days, and started my first project with it. I wanted to use it to create a product stock prediction timeline for the current month.
Basically I get the stock and predicted daily reduction and trace a line from today to the end of the month with the predicted stock. Also, if there is a purchase order to be delivered on day XYZ, I add the delivery amount on that day.
I have a dataframe that contain's the stock for today and the predicted daily redutcion for this month
ITEM STOCK DAILY_DEDUCTION
A 1000 20
B 2000 15
C 800 8
D 10000 100
And another dataframe that contains pending purchase orders and amount that will be delivered.
ITEM DATE RECEIVING_AMOUNT
A 2018-05-16 20
B 2018-05-23 15
A 2018-05-17 8
D 2018-05-29 100
I created this loop to iterate through the dataframe and do the following:
subtract the DAILY_DEDUCTION for the item
if the date is the same as a purchase order date, then add the RECEIVING_AMOUNT
df_dates = pd.date_range(start=today, end=endofmonth, freq='D')
temptable = []
for row in df_stock.itertuples(index=True):
predicted_stock= getattr(row, "STOCK")
item = getattr(row, "ITEM")
for date in df_dates:
date_format = date.strftime('%Y-%m-%d')
predicted_stock = predicted_stock - getattr(linha, "DAILY_DEDUCTION")
order_qty = df_purchase_orders.loc[(df_purchase_orders['DATE'] == date_format)
& (df_purchase_orders['ITEM'] == item), 'RECEIVING_AMOUNT']
if len(df_purchase_orders.index) > 0:
predicted_stock = predicted_stock + order_qty.item()
lista = [date_format, item, int(predicted_stock)]
temptable.append(lista)
And... well, it did the job, but it's quite slow. I run this on 100k rows give or take, and was hoping to find some insight on how I can solve this problem in a way that performs better?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping and Summing Multiple Columns in a DataFrame - python

Related

Keep rows based in more than 2 columns

Faster loop in Pandas looking for ID and older date

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

check for date and time between two columns in pandas data frame

I need help making pandas perform better with dataframe interactions

Categories

Resources