I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.
Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.
See the two pics below for an example input and the desired output.
The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.
pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
'Pickup': row.Pickup,
'Dropoff' : row.Dropoff,
'Price': row.Price},
columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price'])
for i, row in df.iterrows()], ignore_index=True)
There has to be a faster way to do this. Any ideas? Thanks!
pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.
e.g.
Given a list list = []
For each row in the dataframe:
get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates
for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))
Finally you can convert the list of tuples to a dataframe:
df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])
Related
I realize this is probably a very trivial question but I have a dataframe of 1000+ rows and I want to create a new column "Date" but for a single date "2018-01-31". I tried the code below but python just returns "Length of values (1) does not match length of index"
I would really appreciate any help!
Date = ['2018-01-31']
for i in range(len(Output)):
Output['Date']= Date
Assuming Output is the name of your pandas dataframe with 1000+ rows you can do:
Output['Date'] = "2018-01-31"
or using the datetime library you could do:
from datetime import date
Output["Date"] = date(2018, 1, 31)
to format it as a date object rather than a string. You also do not need to iterate over each row if you are wanting the same value for each row. Simply adding a new column with the value will set the value of the new column to the assigned value for each row.
I have a dataframe that has a column containing dates and cities, these dates are repeated every time a new city appears.
I want to leave only 4 specific dates and delete the others. I discovered a function that does this, but that runs one date at a time.
I want to create a function that does this whole process and leaves only the dates I want. Follow the df and code that eliminates one date at a time.
df[df.column != '2020-06-19']
You can do it like this.
df = df[df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
or if you want to remove these dates.
df = df[~df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
So I ran a time series model on a small sales data set, and forecasted sales for next 12 periods. With the following code:
mod1=ARIMA(df1, order=(2,1,1)).fit(disp=0,transparams=True)
y_future=mod1.forecast(steps=12)[0]
where df1 contains the sales values with months being the index. Now I'm storing the predicted values in the following manner:
pred.append(y_future)
Now, I need to append the forecasted values to the original dataset df1, preferably with the same index. I'm trying to use the following code:
df1.append(pred, ignore_index=False)
But I'm getting the following error:
TypeError: cannot concatenate a non-NDFrame object
I've tried converting pred variable to list and then appending, but to no avail.
Any help will be appreciated. Thanks.
One solution could be appending the new array to your dataFrame to the last position using df.loc
df.loc[len(df)] = your_array
But this is not efficient cause if you want to do it several times, it will have to get the length of the DataFrame for each new append.
A better solution would be to create a dictionary of the values that you need to append and append it to the dataFrame.
df = df.append(dict(zip(df.columns, your_array)), ignore_index=True)
You can append your results into a dictionary list and then append that dictionary list to data frame.
Let's assume that you want to append your ARIMA forecasted results to the end of actual data frame with two columns "datetime" (YYYY-MM-DD) and "value" respectively.
Steps to follow
First find the max day in datetime column of your actual data frame and convert it to datetime. We want to assign future dates for our forecasting results.
Create an empty dictionary list and inside a loop fill it by incrementing datetime value 1 day and place a forecasted result subsequently.
Append that dictionary list to your dataframe. Don't forget to reassign it to itself as left hand value since append function creates a copy of appended results data frame.
Reindex your data frame.
Code
lastDay = dfActualData[dfActualData['datetime'] == dfActualData['datetime'].max()].values[0][0]
dtLastDay = lastDay.to_pydatetime("%Y-%m-%d")
listdict = []
for i in range(len(results)):
forecastedDate = dtLastDay + timedelta(days = i + 1)
listdict.append({'datetime':forecastedDate , 'value':results[i]})
dfActualData= dfActualData.append(listdict, ignore_index=True)
dfActualData.reset_index(drop=True)
I have the following script that produces an empty DataFrame using Pandas:
import pandas as pd
import datetime
#Creating a list of row headers with dates.
start=datetime.date(2017,3,27)
end=datetime.date.today() - datetime.timedelta(days=1)
row_dates=[x.strftime('%m/%d/%Y') for x in pd.bdate_range(start,end).tolist()]
identifiers=['A','B','C']
#Creating an empty Dataframe
df=pd.DataFrame(index=identifiers, columns=row_dates)
print(df)
Now, suppose I have a function (let's call it "my_function(index,date)") that requires two inputs: an index, and a date.
That function gives me an outcome and I want to fill the corresponding empty slot of the dataframe with that outcome. But for that, I need to be able to acquire both the index and the date.
For example, let's say I want to fill the first slot of my DataFrame, I require index 'A' and the first Date which is '27/3/2017', so I have this:
my_function('A', '27/3/2017')
How can I make that happen for my entire DataFrame? My apologies if any of this sounds confusing.
Added to the end of your code:
for col in df.columns:
for row in df.iterrows():
print(row[0], col)
That prints (returns) a tuple, consisting of the index (date) and the column name. There may be a faster way to do it.
If you want to just apply a function to every cell in your df, you can use .apply, .map, or .applymap, as necessary. https://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html
I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order