Hi I have a question regarding a resampling in Pandas.
In my data i have a date range from 31/12/2018 to 25/3/2019 with an interval of 7 days(e.g. 31/12/2018, 7/1/2019,14,2019 etc.), I want to resample the sales corresponding to those dates to a new range of dates, say 30/4/2020 to 24/9/2020 with a 7 day interval as previously used. Is there a way to do it using pandas resample function? As shown in the picture, I want to resample the sales from the dataframe on the left and populate the dataframe on the right.
Just to be clear: the left dataframe consists of 13 rows and the right consists of 22 rows.
lets try this:
df=pd.date_range(start='30/4/2020', end='24/9/2020')
The new data frame can be created from the old values, the 'index' is necessary because of the different length. If you wish you can apply df2.fillna(0),too.
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales":df1.sales},index=np.arange(22) )
Or without using 'index':
df2= pd.DataFrame( {"date": pd.date_range("2020-04-30",freq="7D",periods=22), "sales": np.concatenate([df1.sales.values,np.zeros(9)])})
Related
I have a crime dataset where each row is a recorded crime, and the relevant columns are Date, Crime Type, District.
Here is an example with only 2 Districts and 2 Crime Types over a week:
I want to expand it to a dataframe that can be used to run regressions. In this simple example, I need the columns to be Date, District, Murder, Theft. Each District would have a different row for each date in the range, and the crime type categories would be the number of that crimes committed on that Date in that District
Here is the final dataframe:
I need - a time series where #Rows = #Districts * #Dates, and there is a column for each crime type
Are there any good ways to make this without looping through the dataframes?
I can create the date list like this:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
But how do I merge that with my other dataframe and create the columns described above?
I will try to answer my own question here. I think I figured it out, but would appreciate any input on my method. I was able to do it without looping, but rather using pivot_table and merge.
Import packages:
import pandas as pd
from datetime import datetime
import numpy as np
Import crime dataset:
crime_df = pd.read_csv("/Users/howard/Crime_Data.csv")
Create a list of dates in the range:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
Create variables for the length of this date list and length of unique districts list:
nd = len(datelist)
nu = len(df_crime['District'].unique())
Create dataframe combining dates and districts:
date_df = pd.DataFrame({'District':df_crime['District'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Now we turn to our crime dataset.
I added a column of 1s to have something to sum in the next step:
crime_df["ones"] = 1
Next we take our crime data and put it in wide form using Pandas pivot_table:
crime_df = pd.pivot_table(crime_df,index=["District","Date"], columns="Crime Type", aggfunc="sum")
This gave me stacked-level columns and an unnecessary index, so I removed them with the following:
crime_df.columns.droplevel()
crime_df.reset_index(inplace=True)
The final step is to merge the two datasets. I want to put date_df first and merge on that because it includes all the dates in the range and all the districts included for each date. Thus, this uses a Left merge.
final_df = pd.merge(date_df, crime_df, on=["Date", "District"],how="left")
Now I can finish by filling in NaN with 0s
final_df.fillna(0, inplace=True)
Our final dataframe is in the correct form to do time series analyses - regressions, plotting, etc. Many of the plots in matplotlib.pyplot that I use are easier to make if the date column is the index. This can be done like this:
df_final = df_final.set_index(['Date'])
That's it! Hope this helps others and please comment on any way to improve.
I have a dataframe similar to the one below and I want to calculate the sum of the value column for the last seven days. The problem is that there isn't necessarily a row for each day.
df = pd.DataFrame({
'value': [2,3,7,14],
'date': ['10/20/2005','10/22/2005','10/25/2005','10/27/2005']
})
df['date'] = pd.to_datetime(df['date'])
df
value date
2 2005-10-20
3 2005-10-22
7 2005-10-25
14 2005-10-27
What I would like to is something like
df['value'].sum('Last 7 days')
26
The solutions to the problem that I found were always about filling the df with the missing dates, using .asfreq() or .reindex(). Unfortunately, that is not an option for me since I have way too many classes that are each represented in a df like the above one. So filling the df up with the missing dates would create thousands and thousands of extra rows.
Is there a way to use pd.Timedelta() (or similar), where I can treat the missing days as zeros?
Rolling has this intelligently built into the function for datetime based columns:
df.rolling('7d', on='date').sum()
Note that 10/27 and 10/20 are 8 days apart not 7 :)
And if you want to put it into another column:
df['sum'] = df.rolling('7d', on='date').sum()['value']
If you just want the final value:
df.rolling('7d', on='date').sum()['value'].iloc[-1]
I have this Dataset, wind_modified. In this Dataset, columns are the locations and Index is the Date. And the Values in the columns are the wind speeds.
Let's say I want to find the average wind speed in January for each location, how do I use groupby or any other method to find the average?
Would it be possible without resetting the INDEX?
Edit - [This][2] is the actual dataset. I have combined the three columns "Yr, Mo, Dy" into one i.e. "DATE" and made it the INDEX.
I imported the dataset by using pd.read_fwf.
And "DATE" is of type datetime64[ns].
[2]:
Sure, if want all Januaries for all years first filter them by boolean indexing and add mean:
#if necessary convert index to DatetimeIndex
#df.index = pd.to_datetime(df.index)
df1 = df[df.index.month == 1].mean().to_frame().T
Or if need each January for each year separately after filter use groupby with DatetimeIndex.year and aggregate mean:
df2 = df[df.index.month == 1]
df3 = df2.groupby(df2.index.year).mean()
I currently have a data frame dfB which looks like follows
My goal is to now plot the number of orders per week. I am not sure of how to go about grouping my column most_recent_order_date per week,though. How can I do this?
Convert the date column to datetime dtype if you haven't already.
dfB['most_recent_order_date'] = pd.to_datetime(dfB.most_recent_order_date)
Then use resample
dfB.resample('W-Mon', on='most_recent_order_date').sum()
I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order