I need help making pandas perform better with dataframe interactions - python

I'm a newbie and have been studying pandas for a few days, and started my first project with it. I wanted to use it to create a product stock prediction timeline for the current month.
Basically I get the stock and predicted daily reduction and trace a line from today to the end of the month with the predicted stock. Also, if there is a purchase order to be delivered on day XYZ, I add the delivery amount on that day.
I have a dataframe that contain's the stock for today and the predicted daily redutcion for this month
ITEM STOCK DAILY_DEDUCTION
A 1000 20
B 2000 15
C 800 8
D 10000 100
And another dataframe that contains pending purchase orders and amount that will be delivered.
ITEM DATE RECEIVING_AMOUNT
A 2018-05-16 20
B 2018-05-23 15
A 2018-05-17 8
D 2018-05-29 100
I created this loop to iterate through the dataframe and do the following:
subtract the DAILY_DEDUCTION for the item
if the date is the same as a purchase order date, then add the RECEIVING_AMOUNT
df_dates = pd.date_range(start=today, end=endofmonth, freq='D')
temptable = []
for row in df_stock.itertuples(index=True):
predicted_stock= getattr(row, "STOCK")
item = getattr(row, "ITEM")
for date in df_dates:
date_format = date.strftime('%Y-%m-%d')
predicted_stock = predicted_stock - getattr(linha, "DAILY_DEDUCTION")
order_qty = df_purchase_orders.loc[(df_purchase_orders['DATE'] == date_format)
& (df_purchase_orders['ITEM'] == item), 'RECEIVING_AMOUNT']
if len(df_purchase_orders.index) > 0:
predicted_stock = predicted_stock + order_qty.item()
lista = [date_format, item, int(predicted_stock)]
temptable.append(lista)
And... well, it did the job, but it's quite slow. I run this on 100k rows give or take, and was hoping to find some insight on how I can solve this problem in a way that performs better?

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Keep rows based in more than 2 columns

I have a database similar to the one I created below (with 982977 rows × 10 columns), and I wanted to keep the rows where the exams of the same patient (ID) that are different from "COVID" have been performed in a specific period based on the date of the "COVID" exam.
I created 2 columns, one with dates 7 days before and one with 30 days after the original exam date.
Ex: If the patient had an iron exam between 7 days before and 30 days after the date of their COVID exam, then I would keep that patient, otherwise, I would remove.
I did a for loop, but since the database is big, it took almost 6h to complete and when it finished, I lost the connection to the server, and I couldn't continue to manipulate the data
Is there a simpler and/or faster way to do this?
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = [2021-02-22,2021-02-20,2021-06-22,2021-05-22,2021-05-29]
Date7 = [2021-02-15,2021-02-13,2021-06-15,2021-05-15,2021-05-22]
Date30 = [2021-03-24,2021-03-22,2021-07-24,2021-05-22,2021-06-29]
teste = list(zip(ID, Exam, Date, Date7, Date30))
teste2 = pd.DataFrame(teste, columns=['ID','Exam','Date', 'Date7', 'Date30'])
All the dates columns are in datetime already
pacients = []
for pacient in teste2.ID.unique():
a = teste2[teste2.ID==pacient]
b = a[a.Exam!="COVID"]
c = a[a.Exam=="COVID"]
for exam_covid in b.Data:
for covid_7 in c.Data7:
for covid_30 in c.Data30:
if covid_7 < exam_covid < covid_30:
pacients.append(pacient)
pacients = set(pacientes)
pacients = list(pacientes)
With the following sample dataframe named df
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = ['2021-02-22','2021-02-20','2021-06-22','2021-05-22','2021-06-29']
df = pd.DataFrame({'ID': ID, 'Exam': Exam, 'Date': pd.to_datetime(Date)})
you could try the following:
Step: Create a dataframe df_cov that covers all the time intervals around the Covid exams:
df_cov = df[['ID', 'Date']][df.Exam.eq('COVID')]
df_cov = df_cov.assign(
Before=df_cov.Date - pd.Timedelta(days=7),
After=df_cov.Date + pd.Timedelta(days=30)
).drop(columns='Date')
Step: merge the non-Covid-exams in df with df_cov on the column ID, then select the exams that are within the intervals (here with query), and then extract the remaining unique IDs:
patients = (
df[df.Exam.ne('COVID')].merge(df_cov, on='ID', how='left')
.query('(Before < Date) & (Date < After)')
.ID.unique()
)
Result for the sample (I've changed the last exam date such that it won't fall into the required time interval):
array(['1'], dtype=object)

Is there a way to join two datasets on timestamp with an offset such that it connects time_1 with time_2 where time_2 is 2hrs earlier than time_1?

I'm trying to predict delays based on weather 2 hours before scheduled travel. I have one dataset of travel data (call df1) and one dataset of weather (call df2). In order to predict the delay, I am trying to join df1 and df2 with an offset of 2 hours. That is, I want to look at the weather data 2 hours before the scheduled travel data. A paired down view of the data would look something like this
example df1 (travel data):
travel_data
location
departure_time
delayed
blah
KPHX
2015-04-23T15:02:00.000+0000
1
bleh
KRDU
2015-04-27T15:19:00.000+0000
0
example df2 (weather data):
location
report_time
weather_data
KPHX
2015-01-01 01:53:00
blih
KRDU
2015-01-01 09:53:00
bloh
I would like to join the data first on location and then on the timestamp data with a minimum 2 hour offset. If there are multiple weather reports greater than 2 hours earlier than departure time, I would like to join the travel data with the closest report to a 2 hour offset as possible.
So far I have used
joinedDF = airlines_6m_recode.join(weather_filtered, (col("location") == col("location")) & (col("departure_time") == (col("report_date") + f.expr('INTERVAL 2 HOURS'))), "inner")
This works only for the times when the departure time and (report date - 2hrs) match exactly, so I'm losing a large percentage of my data. Is there a way to join to the next closest report date outside the 2hr buffer?
I have looked into window functions but they don't describe how to do joins.
Change the join condition to be >= and get largest report timestamp after partitioning by location.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# 1.Join as per conditions
# 2. Partition by location, order by report_ts desc, add row_number
# 3. Filter row_number == 1
joinedDF = airlines_6m_recode.join(
weather_filtered,
(airlines_6m_recode["location"] == weather_filtered["location"]) & (weather_filtered["report_time_ts"] <= airlines_6m_recode["departure_time_ts"] - F.expr("INTERVAL 2 HOURS"))
, "inner")\
.withColumn("row_number", F.row_number().over(Window.partitionBy(airlines_6m_recode['location'])\
.orderBy(weather_filtered["report_time_ts"].desc())))
# Just to Print Intermediate result.
joinedDF.show()
joinedDF.filter('row_number == 1').show()

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

Categories

Resources