I have the following dataframe containing start & end dates of available slots (called 'operability_df_x' in the code):
I'm trying to calculate how many hours are available per month, let's call it the operability ratio. The output is the following (called 'g' in the code):
The current code I wrote is the following, however since I'm new to pandas & python I get the impression there is a lot of redundancy. Is there an easier way to achieve my goal directly without writing 30 lines of code ? Thanks in advance for any piece of advice
for i in operability_df_x.index :
a = operability_df_x['Start Date'].loc[i].month
b = operability_df_x['Start Date'].loc[i].year
c = datetime.datetime(b,a+1,1,0,0,0)
# The following condition is in case there is an interval that stretches in two months :
window = chain_seq_glob[(chain_seq_glob['Start Date'] < c) & (chain_seq_glob['End Date'] > c)].reset_index(drop=True)
if not window.empty :
if (operability_df_x.iloc[i].equals(window.iloc[0])):
operability_df_x['End Date'].loc[i] = c
window['Start Date'].loc[0] = c
operability_df_x = operability_df_x.append(window)
operability_df_x['Duration [Hours]'] = (operability_df_x['End Date'] - operability_df_x['Start Date'] ) / pd.to_timedelta(1, unit ='h')
operability_df_x = operability_df_x[['Start Date','End Date', 'Duration [Hours]']]
operability_df_x = operability_df_x.sort_values(by="Start Date").reset_index(drop=True)
operability_df_x.to_csv('Operab.csv')
### Final operability table:
g = operability_df_x.set_index('Start Date').groupby(pd.Grouper(freq="M")).sum().reset_index(drop=False)
g['Year/Month'] = g['Start Date'].apply(lambda x: x.strftime('%Y-%m'))
g = g.reindex(columns=['Year/Month','Duration [Hours]','Start Date'])
g.columns = ['Year/Month','Total Available Hours','End month']
g['Total Monthly Hours'] = (g['End month'].apply(lambda x: int(x.strftime('%d'))))*24
g['Operability ratio'] = g['Total Available Hours'] / g['Total Monthly Hours']
g = g.drop(['End month','Total Monthly Hours'], 1)
Let's consider an example dataframe:
df = pd.DataFrame({
'start': pd.to_datetime(['20200103', '20200104', '20200123', '20200205']),
'end': pd.to_datetime(['20200105', '20200107', '20200203', '20200209']),
})
Let's also define a utility function:
def intervals(row):
if row.start.month == row.end.month:
return [(row.start, row.end)]
middle = row.end.replace(day=1, hour=0, minute=0, second=0)
return [(row.start, middle), (middle, row.end)]
Now, let's use it to get a list of intervals (either one or two) for each row depending on how many months the row spans:
df['intervals'] = df.apply(intervals, axis=1)
Now, let's explode this list so that each interval has a row of its own:
df = df.explode('intervals')['intervals']
df = pd.DataFrame(df.tolist(), columns=['start', 'end'])
Let's add a column we'll use later for grouping:
df['month'] = df['start'].dt.strftime('%Y-%m')
And one for the number of hours between start and end:
df['hours'] = (df['end'] - df['start']).astype('timedelta64[h]')
I'm sure there is a better way to get the total number of hours for a given month. I'm doing it by adding two separate columns for the beginning of the current month and the beginning of the next month. Then, I add yet another column to store the difference between the two:
df['month_start'] = df['start'].apply(
lambda d: d.replace(day=1, hour=0, second=0))
df['month_end'] = df['month_start'].apply(
lambda d: d.replace(month=(d.month+1) % 12, year=d.year + int(d.month==12)))
df['total_hours'] = (df['month_end'] - df['month_start']).astype('timedelta64[h]')
Finally, perform the group-by and aggregate:
df = df.groupby('month').agg({'hours': 'sum', 'total_hours': 'first'})
df['ratio'] = df['hours'] / df['total_hours']
There are lots of built-in utility functions for dates I am not familiar with so I'm sure some of the stages could be substituted for an idiomatic expression but this works and is quite readable.
Related
I have a data set as follows,
I need to create the activation date in descending order and the time difference of each adjacent row values while considering each NICS. So the final output should be follows,
Here NA represents that there are no any previous activation dates for that NIC. So that I have used following code to get the activation dates to descending order by considering the NICs as follows,
df['Activation Date'] = pd.to_datetime(df['Activation Date'])
df.sort_values(['NICS', 'Activation Date'], ascending=[False, False], inplace=True)
Now I need to get the time difference as mentioned above. As a example the time difference for first row(NIC=1689687896) is NA, mean that there is no any previous activation date. But for second row(NIC=1689687896) the time difference is 1 month and 1 hour. Hope it is clear.
How should I get that time difference by considering each NIC and there activation dates?
First thing will be to get the dataframe corresponding to each NICS ID
Then sort the dataframe wrt Activation Date
Then calculate the date difference (here we will get in seconds and then it has to be converted to months and hours as per the image shown above)
Here is a small snippet performing the same
uniq_nics_id = pd.unique(df['NICS'])
difference = []
for nics_id in uniq_nics_id:
temp_df = df[df['NICS']==nics_id]
# if dataframe is not sorted
temp_df['Activation Date'] = pd.to_datetime(temp_df['Activation Date'])
temp_df.sort_values(['Activation Date'], ascending=[False], inplace=True)
for inx in range(len(temp_df)):
if inx==0:
difference.append("N/A")
else:
diff = int((temp_df.iloc[inx]['Activation Date'] - temp_df.iloc[inx-1]['Activation Date']).total_seconds())
month = diff // (3600*30*24)
month_str = f"{month} months" if month != 1 else f"{month} month"
month_rem = diff % (3600*30*24)
hours = month_rem // 3600
hr_str = f"{hours} hours" if hours != 1 else f"{hours} hour"
diff_str = month_str + " " + hr_str
difference.append(diff_str)
df['Difference'] = difference
I hope this solution works for you.
I don't know if you have a pandas' specific way of doing that but you could do it with a for loop:
# Create a new column with NaN's
df["Difference"] = pd.NA
# Set starting reference values for NCIS and start date
nics = df["NICS"][0]
start_date = df["Activation Date"][0]
# Loop through the dataframe by rows
for i in range(len(df)):
# If the NCIS is the same, place the value in "Difference" column
if df["NICS"][i] == nics:
df.at[i, "Difference"] = df["Activation Date"][i] - start_date
# If the NCIS Changed, update the references
else:
nics = df["NICS"][i]
start_date = df["Activation Date"][i]
print(df)
I have to 2 dfs:
dfMiss:
and
dfSuper:
I need to create a final output that summarises the data in the 2 tables which I am able to do shown in the code below:
dfCity = dfSuper \
.groupby(by='City').count() \
.drop(columns='Superhero ID') \
.rename(columns={'Superhero': 'Total count'})
print("This is the df city : ")
print(dfCity)
## Convert column MissionEndDate to DateTime format
for df in dfMiss:
# Dates are interpreted as MM/dd/yyyy by default, dayfirst=False
df['Mission End date'] = pd.to_datetime(df['Mission End date'], dayfirst=True)
# Get Year and Quarter, given Q1 2020 starts in April
date = df['Mission End date'] - pd.DateOffset(months=3)
df['Mission End quarter'] = date.dt.year.astype(str) + ' Q' + date.dt.quarter.astype(str)
## Get no. Superheros working per City per Quarter
dfCount = []
for dfM in dfMiss:
# Merge DataFrames
df = dfSuper.merge(dfM, left_on='Superhero ID', right_on='SID')
df = df.pivot_table(index=['City', 'Superhero'], columns='Mission End quarter', aggfunc='nunique')
# Get the first group (all the groups have the same values)
df = df[df.columns[0][0]]
# Group the values by City (effectively "collapsing" the 'Superhero' column)
df = df.groupby(by=['City']).count()
dfCount += [df]
## Get no. Superheros available per City per Quarter
dfFree = []
for dfC in dfCount:
# Merge DataFrames
df = dfCity.merge(right=dfC, on='City', how='outer').fillna(0) # convert NaN values to 0
# Subtract no. working superheros from total no. superheros per city
for col in df.columns[1:]:
df[col] = df['Total count'] - df[col]
dfFree += [df.astype(int)]
print(dfFree)
dfResult = pd.DataFrame(dfFree)
The problem is when I try to convert DfFree into a dataframe I get the error:
"ValueError: Must pass 2-d input. shape=(1, 4, 5) "
The line that raises the error is
dfResult = pd.DataFrame(dfFree)
Anyone have any idea what this means and how I can convert the list into a df?
Thanks :)
separate your code using SOLID. separation of concerns. It is not easy to read
sid=[665544,665544,2121,665544,212121,123456,666666]
mission_end_date=["10/10/2020", "03/03/2021", "02/02/2021", "05/12/2020", "15/07/2021", "03/06/2021", "12/10/2020"]
superherod_sid=[212121,364331,678523,432432,665544,123456,555555,666666,432432]
hero=["Spiderman","Ironman","Batman","Dr. Strange","Thor","Superman","Nightwing","Loki","Wolverine"]
city=["New York","New York","Gotham","New York","Asgard","Metropolis","Gotham","Asgard","New York"]
df_mission=pd.DataFrame({'sid':sid,'mission_end_date':mission_end_date})
df_super=pd.DataFrame({'sid':superherod_sid,'hero':hero, 'city':city})
df=df_super.merge(df_mission,on="sid", how="left")
df['mission_end_date']=pd.to_datetime(df['mission_end_date'])
df['mission_end_date_quarter']=df['mission_end_date'].dt.quarter
df['mission_end_date_year']=df['mission_end_date'].dt.year
print(df.head(20))
pivot = df.pivot_table(index=['city', 'hero'], columns='mission_end_date_quarter', aggfunc='nunique').fillna(0)
print(pivot.head())
I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])
You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)
If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.
I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]
I have working code that achieves the desired calculation result, but I am currently using an algorithm that iterates over the pandas array. this is obviously slower than pure pandas DataFrame calculations. Would like some advice on how i can use pandas functions to speed up this calculation
Code to generate dummy data
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
This is basically a pandas DataFrame with MTD figures for some value. This is purely given so that we have some data to play with.
Needed calculation
what I need is a new DataFrame that has starting (investment) dates as columns - populating them with a few beginning of month values. the index is all possible dates and the values should be the YTD figure. I am using this Dataframe as a lookup/cache for investement dates
pseudocode
YTD = (1+last MTD figure) * ((1+last MTD figure)... for all months to the required date
Working function
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
I have hardcoded the investment dates list to simplify the problem statement. On my machines this code takes 2.5 to 3.5 seconds. Any suggestions on how i can speed it up?
Here's an approach that should be reasonably quick. Quite possible there is something faster/cleaner, but this should be an improvement.
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)