Pandas create rows based on interval between to dates - python

I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.
For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.
This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.
I have not succeeded in finding an alternative method which is less resource intensive.
What would be the best approach for this?
My current method here:
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
This gives the following result
Now to create a set containing all possible dates between the min and max date of the set:
df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max() + pd.Timedelta('1d'), freq='1d')})
This results in an expected frame containing all the possible dates
Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row
df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]
This leads to the following final
This final dataframe is exactly what is needed.

Efficient Solution
d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount() + 1
df1['date'] = df1['date_start'] + pd.to_timedelta(i, unit='d')
How it works?
Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result
Result
id number color date_start date_end date
0 aa0 1 blue 2022-01-01 2022-01-02 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-03
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-04
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-08
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-09
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-13
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-14

I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))

Related

groupby and mean returning NaN

I am trying to use groupby to group by symbol and return the average of prior high volume days using pandas.
I create my data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
"symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
"change": [20, 1, 2, 3, 50, 100],
"volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})
Filter by high volume and change:
high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]
Then I get the last days volume (this works):
high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)
But when I try to calculate the average of all the days per symbol:
high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()
I am getting NaNs:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN NaN
4 2022-01-05 AAA 50 40000000 NaN NaN
5 2022-01-06 AAA 100 60000000 40000000.0 NaN
What am I missing here?
Desired output:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000
4 2022-01-05 AAA 50 40000000 NaN 40000000
5 2022-01-06 AAA 100 60000000 40000000.0 50000000
high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)
high_volume_days
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000.0
4 2022-01-05 AAA 50 40000000 NaN 40000000.0
5 2022-01-06 AAA 100 60000000 40000000.0 50000000.0
Index misalignment: high_volume_days is indexed by integers. The df.groupby(...) is indexed by the symbol.
Use merge instead:
high_volume_days = pd.merge(
high_volume_days,
df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
left_on="symbol",
right_index=True,
)
df.groupby('symbol')['volume'].mean() returns:
symbol
AAA 33333500.0
ABC 6667700.0
Name: volume, dtype: float64
which is an aggregation of each group to a single value. Note that the groups (symbol) are the index of this series. When you try to assign it back to high_volume_days, there is an index misalignment.
Instead of an aggregation (.mean() is equivalent to .agg("mean")), you should use a transformation: .transform("mean").
==== EDIT ====
Instead of the mean for all values, you're looking for the mean "thus far". You can typically do that using .expanding().mean(), but since you're reassigning back to a column in high_volume_days, you need to either drop the level that contains the symbols, or use a lambda:
high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())

Counting Specific Values by Month

I have some data I want to count by month. The column I want count has three different possible values, each representing a different car sold. Here is an example of my dataframe:
Date Type_Car_Sold
2015-01-01 00:00:00 2
2015-01-01 00:00:00 1
2015-01-01 00:00:00 1
2015-01-01 00:00:00 3
... ...
I want to make it so I have a dataframe that counts each specific car type sold by month separately, so looking like this:
Month Car_Type_1 Car_Type_2 Car_Type_3 Total_Cars_Sold
1 15 12 17 44
2 9 18 20 47
... ... ... ... ...
How exactly would I go about doing this? I've tried doing:
cars_sold = car_data['Type_Car_Sold'].groupby(car_data.Date.dt.month).agg('count')
but that just sums up all the cars sold in the month, rather than breaking it down by the total amount of each type sold. Any thoughts?
Maybe not the cleanest solution, but this should get you pretty close
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df['Value'] = 1
print(pd.pivot_table(df, values='Value', index=['Date'], columns=['Type'], aggfunc='count'))
Type 1 2
Date
2022-01 1.0 1.0
2022-02 2.0 NaN
Alternatively you can also pass multiple columns to groupby:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df.groupby(['Date', 'Type']).size()
Date Type
2022-01 1 1
2 1
2022-02 1 2
dtype: int64
This seems to have the unfortunate side effect of excluding keys with zero value. Also the result is multiindexed rows rather than having the index as rows+columns.
For more information on this approach, check this question.

Replacing a for loop with something more efficient when comparing dates to a list

Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

indexing a data frame using multiple date time in Python

I have a data frame CaliSimNG
CaliSimNG
Date Sim1 Sim2 Sim3 Sim4 Sim5
0 2018-01-01 4.410628 5.181019 3.283512 2.289767 6.930455
1 2018-01-02 3.919023 5.572350 4.899945 1.858528 7.724655
2 2018-01-03 4.804969 4.477524 7.339943 1.963685 8.186425
3 2018-01-04 4.226408 4.208243 18.850381 1.967792 27.341537
4 2018-01-05 4.441108 3.731662 14.349406 2.000143 7.804742
I want to select the row from certain dates. The dates are marked by date time array DesiredDates
DesiredDates
array(['2018-01-01T19:00:00.000000000-0500',
'2018-01-04T19:00:00.000000000-0500',
'2018-01-05T19:00:00.000000000-0500'],
dtype='datetime64[ns]')
how can I get a subset of CaliSimNG using the datetime index in DesiredDates?
Thanks
You can do an inner join using the pandas "merge" function as described here.
For example:
left = pd.DataFrame({'Date': ['date1', 'date2', 'date3'], 'v': [1, 2, 3]})
right = pd.DataFrame({'Date': ['date2']})
joined = pd.merge(left, right, on='Date')
Produces:
joined
Date v
0 date2 2

Categories

Resources