Python de-aggregation - python

I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates.
As a sample
StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 79089
The data set I want is:
StoreID Date Number_Sales
78 12/04/2015 79089/38(as there are 38 days in between)
78 13/04/2015 79089/38(as there are 38 days in between)
78 14/04/2015 79089/38(as there are 38 days in between)
78 ...
78 17/05/2015 79089/38(as there are 38 days in between)
Any help would be useful.
Thanks

I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):
import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
'date_end':['18/05/2015','10/06/2015'],
'sales':[79089, 1000]})
df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days
master_df = pd.DataFrame(None)
for row in df.index:
new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
end = df['date_end'].iloc[row],
freq='d'))
new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
master_df = pd.concat([master_df, new_df], axis=0)
First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.

What about creating a new dataframe?
start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])
yields
In[42]: res.head(5)
Out[42]:
Number_Sales
2015-04-12 2196.916667
2015-04-13 2196.916667
2015-04-14 2196.916667
2015-04-15 2196.916667
2015-04-16 2196.916667
If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.
df = pd.DataFrame({'Store_ID': [78, 78, 80],
'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
'Total_Number_of_sales': [79089., 50000., 25000.]})
to_concat = []
for _, row in df.iterrows():
start = pd.to_datetime(row['Date_Start'], dayfirst=True)
end = pd.to_datetime(row['Date_End'], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
id = [row['Store_ID']] * len(idx)
res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
to_concat.append(res)
res = pd.concat(to_concat)
There are definitley more elegant solutions, have a look for example at this thread.

Consider building a list of data frames with the DataFrame constructor iterating through each row of main data frame. Each iteration will expand a sequence of days from Start_Date to end of range with needed sales division of total sales by difference of days:
from io import StringIO
import pandas as pd
from datetime import timedelta
txt = '''StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 89089'''
df = pd.read_table(StringIO(txt), sep="\s+", parse_dates=[1, 2], dayfirst=True)
df['Diff_Days'] = (df['Date_End'] - df['Date_Start']).dt.days
def calc_days_sales(row):
long_df = pd.DataFrame({'StoreID': row['StoreID'],
'Date': [row['Date_Start'] + timedelta(days=i)
for i in range(row['Diff_Days']+1)],
'Number_Sales': row['Total_Number_of_sales'] / row['Diff_Days']})
return long_df
df_list = [calc_days_sales(row) for i, row in df.iterrows()]
final_df = pd.concat(df_list).reindex(['StoreID', 'Date', 'Number_Sales'], axis='columns')
print(final_df.head(10))
# StoreID Date Number_Sales
# 0 78 2015-04-12 2259.685714
# 1 78 2015-04-13 2259.685714
# 2 78 2015-04-14 2259.685714
# 3 78 2015-04-15 2259.685714
# 4 78 2015-04-16 2259.685714
# 5 78 2015-04-17 2259.685714
# 6 78 2015-04-18 2259.685714
# 7 78 2015-04-19 2259.685714
# 8 78 2015-04-20 2259.685714
# 9 78 2015-04-21 2259.685714
reindex at end not needed for Python 3.6 since data frame's input dictionary will be ordered.

Related

Price column lost when converting timestamp column data to datetime

I'm preparing my data for price analytics, so I created this code that pull the price feed from Coingecko API, sort the required columns, change the headers names and convert the date.
The current block I'm facing is once I convert the timestamp to datetime, I lose the price column, so how can I get it back along with the new date format?
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2 .rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
DATAFRAME EXPLODED:
ds y
0 1618185600000 59988.020959
1 1618272000000 59911.020595
2 1618358400000 63576.676041
3 1618444800000 62807.123233
4 1618531200000 63179.772446
.. ... ...
86 1625616000000 34149.989815
87 1625702400000 33932.254638
88 1625788800000 32933.578199
89 1625875200000 33971.297750
90 1625895274000 33738.909080
[91 rows x 2 columns]
DATAFRAME TAILED:
86 2021-07-07 00:00:00
87 2021-07-08 00:00:00
88 2021-07-09 00:00:00
89 2021-07-10 00:00:00
90 2021-07-10 05:34:34
Name: ds, type: datetime64[ns]
ValueError: Shape of passed values is (91, 1), indices imply (91, 3)
Change :
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
to :
df2['ds_datetime'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
Try this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2.rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2['ds'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
# df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
By writing df2 = df2['ds'].mul(1e6).apply(pd.Timestamp), you removed the price column from df2.

Calculating totals by month based on two dates in Pandas

I have a dataframe containing two columns of dates: start date and end date. I need to set up a dataframe where all months of the year are set up in separate columns based on the start and end date intervals so I can sum values from another column for each of the months per name.
To illustrate:
Original df:
Start Date End Date Name Value
10/22/20 01/25/21 John 100
10/12/20 04/30/21 John 50
02/25/21 None John 20
Desired df:
Name Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 Jul_21 Aug_21 ...
John 150 150 150 150 70 70 70 20 20 20 20 ...
Any suggestions or pointers on how I could achieve that result would be greatly appreciated!
First convert values to datetimes with replace non datetimes to missing values and replace them to some date, then in list comprehension get all months to Series, which is used for pivoting by DataFrame.pivot_table:
end = '2021-12-31'
df['Start'] = pd.to_datetime(df['Start Date'])
df['End'] = pd.to_datetime(df['End Date'], errors='coerce').fillna(end)
s = pd.concat([pd.Series(r.Index,pd.date_range(r.Start, r.End, freq='M'))
for r in df.itertuples()])
df1 = pd.DataFrame({'Date': s.index}, s).join(df)
df2 = df1.pivot_table(index='Name',
columns='Date',
values='Value',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.strftime('%b_%y')
print (df2)
Date Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 \
Name
John 150 150 150 50 70 70 70 20 20
Date Jul_21 Aug_21 Sep_21 Oct_21 Nov_21 Dec_21
Name
John 20 20 20 20 20 20

groupby Date year-month

I read and transform data using the following code
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as dates
import numpy as np
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv', parse_dates=['Date'])
df.drop('ID', axis='columns', inplace = True)
df_min = df[(df['Date']<='2014-12') & (df['Date']>='2004-01') & (df['Element']=='TMIN')]
df_min.drop('Element', axis='columns', inplace = True)
df_min = df_min.groupby('Date').agg({'Data_Value': 'min'}).reset_index()
giving the following result
Date Data_Value
0 2005-01-01 -56
1 2005-01-02 -56
2 2005-01-03 0
3 2005-01-04 -39
4 2005-01-05 -94
Now I try to get the Date in Year-Month. So
Date Data_Value
0 2005-01 -94
1 2005-02 xx
2 2005-03 xx
3 2005-04 xx
4 2005-05 xx
Where xx is the minimum value for that year-month.
how do I have to change the Groupby function or is this not possible with this function?
Use pd.Grouper() to accumulate by yearly/monthly/daily frequencies.
Code
df_min["Date"] = pd.to_datetime(df_min["Date"])
df_ans = df_min.groupby(pd.Grouper(key="Date", freq="M")).min()
Result
print(df_ans)
Data_Value
Date
2005-01-31 -94
You can first map Date column in order to get only year and month, and then just perform a groupby and get the min for each group:
# import libraries
import pandas as pd
# test data
data = [['2005-01-01', -56],['2005-01-01', -3],['2005-01-01', 6],
['2005-01-01', 26],['2005-01-01', 56],['2005-02-01', -26],
['2005-02-01', -2],['2005-02-01', 6],['2005-02-01', 26],
['2005-03-01', 56],['2005-03-01', -33],['2005-03-01', -5],
['2005-03-01', 6],['2005-03-01', 26],['2005-03-01', 56]]
# create dataframe
df_min = pd.DataFrame(data=data, columns=["Date", "Date_value"])
# convert 'Date' column to datetime datatype
df_min['Date'] = pd.to_datetime(df_min['Date'])
# get only year and month
df_min['Date'] = df_min['Date'].map(lambda x: str(x.year)+'-'+str(x.month))
# get min value for each group
df_min = df_min.groupby('Date').min()
After printing df_min, output must be:
Date_value
Date
2005-01-01 -56
2005-02-01 -26
2005-03-01 -33

PANDAS Time Series Window Labels

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)

Find maximum of column for each business quarter pandas

Assume that I have the following data set
import pandas as pd, numpy, datetime
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 12, 31)
date_list = pd.date_range(start, end, freq='B')
numdays = len(date_list)
value = numpy.random.normal(loc=1e3, scale=50, size=numdays)
ids = numpy.repeat([1], numdays)
test_df = pd.DataFrame({'Id': ids,
'Date': date_list,
'Value': value})
I would now like to calculate the maximum within each business quarter for test_df. One possiblity is to use resample using rule='BQ', how='max'. However, I'd like to keep the structure of the array and just generate another column with the maximum for each BQ, have you guys got any suggestions on how to do this?
I think the following should work for you, this groups on the quarter and calls transform on the 'Value' column and returns the maximum value as a Series with it's index aligned to the original df:
In [26]:
test_df['max'] = test_df.groupby(test_df['Date'].dt.quarter)['Value'].transform('max')
test_df
Out[26]:
Date Id Value max
0 2015-01-01 1 1005.498555 1100.197059
1 2015-01-02 1 1032.235987 1100.197059
2 2015-01-05 1 986.906171 1100.197059
3 2015-01-06 1 984.473338 1100.197059
........
256 2015-12-25 1 997.965285 1145.215837
257 2015-12-28 1 929.652812 1145.215837
258 2015-12-29 1 1086.128017 1145.215837
259 2015-12-30 1 921.663949 1145.215837
260 2015-12-31 1 938.189566 1145.215837
[261 rows x 4 columns]

Categories

Resources