Extend a pandas dataframe to include 'missing' weeks

Extend a pandas dataframe to include 'missing' weeks - python

I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})

Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].

Related

Convert the values in dataframe to more usable format for Time series modeling

My dataframe has values of how many red cars are sold on a specific month. I have to build a predictive model to predict monthly sale
I want the current data frame to be converted into the format below for time series modeling.
How can I read the column and row header to create a date column? I am hoping for a new data frame.

You can use melt() to transform the dataframe from the wide to the long format. Then we combine the Year and month information to make an actual date:
import pandas as pd
df = pd.DataFrame({'YEAR' : [2021,2022],
'JAN' : [5, 232],
'FEB':[545, 48]})
df2 = df.melt(id_vars = ['YEAR'], var_name = 'month', value_name = 'sales')
df2['date'] = df2.apply(lambda row: pd.to_datetime(str(row['YEAR']) + row['month'], format = '%Y%b'), axis = 1)
df2.sort_values('date')[['date', 'sales']]
this gives the output:
date sales
0 2021-01-01 5
2 2021-02-01 545
1 2022-01-01 232
3 2022-02-01 48
(for time series analysis you would probably want to set the date column as index)

Pandas GroupBy and Mean over Date Range

I would like to compute the mean per ID using groupby and mean. However, I only need the rows where Date is between year 2016-01-01 and 2017-12-31.
d = {'ID': ['STCK123', 'STCK123', 'STCK123'], 'Amount': [250, 400, 350],
'Date': ['2016-01-20', '2017-09-25', '2018-05-15']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Amount', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
This gives following df:
ID Amount Date
STCK123 250 2016-01-20
STCK123 400 2017-09-25
STCK123 350 2018-05-15
When I use:
data.groupby(['ID'])['Amount'].agg('mean')
It takes all rows into account, resulting in a mean value of 333.3. How can I exclude the rows where Date is 2018 (yielding a mean value of (250+400)/2=325)?

You'll need a pre-filtering step with query:
df.query('Date.dt.year != 2018').groupby('ID').mean()
Amount
ID
STCK123 325
More uses for eval, query, and associated parameters can be found here in my writeup: Dynamic Expression Evaluation in pandas using pd.eval()
See here for more methods on dropping rows before calling groupby.
You can also mask those rows, without having to drop them. NaNs are excluded from the GroupBy aggregation.
df.mask(df.Date.dt.year == 2018).groupby('ID').mean()
Amount
ID
STCK123 325.0

Melt dataframe with first two rows as variables

apologies but this has me stumped, I thought I could pass the following dataframe into a simple pd.melt using iloc to reference my varaibles but it wasn't working for me (i'll post the error in a moment)
sample df
Date, 0151, 0561, 0522, 0912
0,Date, AVG Review, AVG Review, Review, Review
1,Date NaN NaN NaN NaN
2,01/01/18 2 2.5 4 5
so as you can see, my ID as in the top row, the type of review is in the 2nd row, the date sits in the first column and the observations of the review are in rows on the date.
what I'm trying to do is melt this df to get the following
ID, Date, Review, Score
0151, 01/01/18, Average Review 2
I thought I could be cheeky and just pass the following
pd.melt pd.melt(df,id_vars=[df.iloc[0]],value_vars=df.iloc[1] )
but this threw the error 'Series' objects are mutable, thus they cannot be hashed
I've had a look at similar answers to pd.melt and perhaps reshape or unpivot? but I'm lost on how I should proceed.
any help is much appreciated.
Edit for Nixon :
My first Row has my unique IDs
2nd row has my observation, which in this case is a type of review (average, normal)
3rd row onward has the variables assigned to the above observation - lets call this score.
1st column has my dates which have the score across by row.

An alternative to pd.melt is to set your rows as column levels of a multiindex and then stack them. Your metadata will be stored as an index rather than column though. Not sure if that matters.
df = pd.DataFrame([
['Date', '0151', '0561', '0522', '0912'],
['Date', 'AVG Review', 'AVG Review', 'Review', 'Review'],
['Date', 'NaN', 'NaN', 'NaN', 'NaN'],
['01/01/18', 2, 2.5, 4, 5],
])
df = df.set_index(0)
df.index.name = 'Date'
df.columns = pd.MultiIndex.from_arrays([df.iloc[0, :], df.iloc[1, :]], names=['ID', 'Review'])
df = df.drop(df.index[[0, 1, 2]])
df.stack('ID').stack('Review')
Output:
Date ID Review
01/01/18 0151 AVG Review 2
0522 Review 4
0561 AVG Review 2.5
0912 Review 5
dtype: object
You can easily revert index to columns with reset_index.

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?

The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).

my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Multi-indexing - accessing the last time in every day

New to multiindexing in Pandas. I have data that looks like this
Date Time value
2014-01-14 12:00:04 .424
12:01:12 .342
12:01:19 .341
...
12:05:49 .23
2014-05-12 ...
1:02:42 .23
....
For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this
df= pd.read_csv("df.csv",index_col=0)
df.index = pd.to_datetime(df.index,infer_datetime_format=True)
df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time'])
df= df[~df.index.duplicated(keep='first')]
dates = df.index.get_level_values(0)
So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?

This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):
import pandas as pd
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index('Time', drop=False)
max_times = df.groupby(level=0)['Time'].max()
print(max_times)
yields
Date
2014-01-14 12:05:49
2014-05-12 1:02:42
Name: Time, dtype: object
If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_timedelta(df['Time'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index()
idx = df.groupby(['Date'])['Time'].idxmax()
print(df.loc[idx])
yields
Date Time value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
I don't see a good way to do this while keeping the MultiIndex.
It is easier to perform the groupby operation before setting the MultiIndex.
Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:
df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['DateTime'] = pd.to_datetime(df['DateTime'])
# df = pd.read_csv('df.csv', parse_dates=[0])
idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax()
result = df.loc[idx]
print(result)
yields
DateTime value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extend a pandas dataframe to include 'missing' weeks - python

Related

Convert the values in dataframe to more usable format for Time series modeling

Pandas GroupBy and Mean over Date Range

Melt dataframe with first two rows as variables

Pandas reindex and interpolate time series efficiently (reindex drops data)

Multi-indexing - accessing the last time in every day

Categories

Resources