Create Time-Series Dataframe for Regression in Python

Create Time-Series Dataframe for Regression in Python - python

I have a crime dataset where each row is a recorded crime, and the relevant columns are Date, Crime Type, District.
Here is an example with only 2 Districts and 2 Crime Types over a week:
I want to expand it to a dataframe that can be used to run regressions. In this simple example, I need the columns to be Date, District, Murder, Theft. Each District would have a different row for each date in the range, and the crime type categories would be the number of that crimes committed on that Date in that District
Here is the final dataframe:
I need - a time series where #Rows = #Districts * #Dates, and there is a column for each crime type
Are there any good ways to make this without looping through the dataframes?
I can create the date list like this:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
But how do I merge that with my other dataframe and create the columns described above?

I will try to answer my own question here. I think I figured it out, but would appreciate any input on my method. I was able to do it without looping, but rather using pivot_table and merge.
Import packages:
import pandas as pd
from datetime import datetime
import numpy as np
Import crime dataset:
crime_df = pd.read_csv("/Users/howard/Crime_Data.csv")
Create a list of dates in the range:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
Create variables for the length of this date list and length of unique districts list:
nd = len(datelist)
nu = len(df_crime['District'].unique())
Create dataframe combining dates and districts:
date_df = pd.DataFrame({'District':df_crime['District'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Now we turn to our crime dataset.
I added a column of 1s to have something to sum in the next step:
crime_df["ones"] = 1
Next we take our crime data and put it in wide form using Pandas pivot_table:
crime_df = pd.pivot_table(crime_df,index=["District","Date"], columns="Crime Type", aggfunc="sum")
This gave me stacked-level columns and an unnecessary index, so I removed them with the following:
crime_df.columns.droplevel()
crime_df.reset_index(inplace=True)
The final step is to merge the two datasets. I want to put date_df first and merge on that because it includes all the dates in the range and all the districts included for each date. Thus, this uses a Left merge.
final_df = pd.merge(date_df, crime_df, on=["Date", "District"],how="left")
Now I can finish by filling in NaN with 0s
final_df.fillna(0, inplace=True)
Our final dataframe is in the correct form to do time series analyses - regressions, plotting, etc. Many of the plots in matplotlib.pyplot that I use are easier to make if the date column is the index. This can be done like this:
df_final = df_final.set_index(['Date'])
That's it! Hope this helps others and please comment on any way to improve.

Related

How to plot graph for two csv

I have two csv which have 5 columns id product name normal price discount price card price and discount both csv have different prices from different timelines so i am looking to plot a graph because i want to see the behavior of the prices for each product, both have products that are repeated, so I would like those that are repeated to appear in the graph.
I tried converting to a data frame but no success.
This is how it look the csv file

It would be helpful to your second .csv file too.
From this answer.
Assuming both your files have the same column names, use Pandas' pandas.concat() method to make one dataframe from a list of dataframes.
Then, simply plot a graph of your new dataframe as usual.
import pandas as pd
listOfDFs = []
df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('csv2.csv')
listOfDFs.append(df1)
listOfDFs.append(df2)
dfBoth = pd.concat(listOfDFs, axis=0, ignore_index=True)
dfBoth.plot()

How to aggregate geographically segregated data by date in python?

I am working with timeseries data that is broken down by several geographical levels (province and health region). The data are further broken down by age group. I would like to aggregate data first up to the health region, and then up to the province, depending on the need.
Sample data:
df = pd.DataFrame({'Date':['1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022'],
'province':[35,35,35,35,35,35,35,35,35,],'health region':[1,1,1,2,2,2,3,3,3],
'age group':[1,2,3,1,2,3,1,2,3],'cases':[6,1,9,7,9,0,4,2,2]})
Desired output when aggregating up to the health region level:
df_hr = pd.DataFrame({'Date':['1/1/2022','1/1/2022','1/1/2022'],
'province':[35,35,35],
'health region':[1,2,3],'cases':[16,16,8]})
When I use the following code:
df = df.groupby('health region').sum()
I lose dates.
When I try:
df = df.groupby(['health region','Date']).sum()
or
df = df.groupby(['health region',df['Date'].dt.date).sum()
I get an error ValueError: mixed datetimes and integers in passed array
Is there an easy way to do this? I was thinking of using a loop to split the data by health region, saving unique dates, aggregating, merging dates back, then stacking health regions back together. But I'd rather not if there is an easy way to do that.
Thank you,
i.

I figured out a way how to do that.
First need to convert Date to datetime and set it as index:
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace = True)
Then I can group by region and resample data by day
df1 = df.groupby('health region').resample('D').sum()
And reset index
df1 = df1.reset_index()

Assigning name to index when using groupby() in pandas

I have weather data over a variety of years. In this I am trying to find the long term averages for the temperature of each month, which I achieved using the following.
mh3 = mh3.groupby([mh3.index.month, mh3.index.day])
mh3 = mh3[['dry_bulb_tmp_mean', 'global_horiz_radiation']].mean()
However, in doing this, I get two index's for the dataframe (both month and day which is fine). The issue is that both of these index columns are assigned the name date. Is there a way to manually add a name? This causes problems later in my code when I need to do some data analysis by month. Thank you

The name of the Series you group with becomes the name of the Index levels so rename them in the grouper.
mh3 = mh3.groupby([mh3.index.month.rename('month'), mh3.index.day.rename('day')])
Or if you don't want to type as much you can create the grouping with a list comprehension, getattr and renaming to the attribute.
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', freq='4H', periods=10),
data={'col1': range(10)})
grpr = [getattr(df.index, attr).rename(attr) for attr in ['month', 'day']]
df.groupby(grpr).sum()
# col1
#month day
#1 1 15
# 2 30

Grouping a dataframe and reordering based on date and counts

I have the following dataframe, that is grouped according to the invoice cycle first, then added to a count of clinics in each invoice cycle.
Dataframe after groupby function
I used the following code to add the count column:
df5 = df4.groupby(['Invoice Cycle', 'Clinic']).size().reset_index(name='counts')
and then this code to set the index and get the dataframe, as seen in the image above:
df5 = df5.set_index(['Invoice Cycle','Clinic'])
Now, I want to reorder the Invoice Cycle column so the dates are in order 16-Dec, 17-Jan, 17-Feb, 17-Mar, etc.
Then I want to reorder the clinics in each invoice cycle so clinic with the highest count is on the top and the clinic with the lowest count is on the bottom.
Given the values in Invoice Cycle are strings, and not timestamps, I can't seem to do both of the above tasks.
Is there a way to reorder the dataframe?

You can create a function to transform the date-string into a datetime format:
import pandas as pd
import datetime
def str_to_date(string):
# This will get you the date with the first day of the month (ex. 01-Jan-2017)
date = datetime.datetime.strptime(string, '%y-%b')
return date
df['Invoice Cycle'] = df['Invoice Cycle'].apply(str_to_date)
# now you an sort correctly
df = df.sort_values(['Invoice Cycle', 'counts'])

Python Pandas Index Sorting/Grouping/DateTime

I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you

Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.