I have a full-year hourly series, that we may call "calendar":
from pandas import date_range, Series
calendar = Series(
index=date_range("2006-01-01", "2007-01-01", freq="H", closed="left", tz="utc"),
data=range(365 * 24)
)
Now I have a new index, which is another hourly series, but starting and ending at arbitrary datetimes:
index = date_range("2019-01-01", "2020-10-02", freq="H", tz="utc")
I would like to create a new series result that has the same index as index and, for each month-day-hour combination, it takes the value from the corresponding month-day-hour in the calendar.
I could iterate to have a working solution like so, with a try-except just to ignore February 29th:
result = Series(index=index, dtype="float")
for timestamp in result.index:
try:
calendar_timestamp = timestamp.replace(year=2006)
except:
continue
result.loc[timestamp] = calendar.loc[calendar_timestamp]
This however, is very inefficient, so does anybody know how to do it better? With better I mean specially faster (CPU-time-wise).
Constraints/notes:
No Numba, nor Cython, just CPython and Pandas/NumPy
It is fine to leave February 29th with NaN values, since that day is not represented in the calendar
We can always assume that the index is properly sorted and has no gaps (the same applies to the calendar)
Let's try extracting the combination as string and map:
cal1 = pd.Series(calendar.values,
index=calendar.index.strftime('%m%d%H'))
result = index.to_series().dt.strftime('%m%d%H').map(cal1)
Output:
Related
I have been stuck on this way too long. All I am trying to do is create a new column called Duration Target Date which derives from Standard Duration Days + Date/Time Created. Below is my code so far:
From my POV, I think that this code will iterate from 0 to the length of the data frame. If there is "No Set Standard Duration" in the Standard Duration Days column then goes to my else statement and overwrites that given cell with blank(same as I initialized it). However, if the code realizes that there is anything but "No Set Standard Duration" it should then add the value from the given cell from column Standard Duration Days with column Date/Time Created. I want the new value to be in the new column Duration Target Date at the corresponding index.
newDF["Duration Target Date"] = ""
for i in range(0,len(newDF)):
if newDF.loc[i,"Standard Duration Days"] != "No Set Standard Duration":
newDF.loc[i,"Duration Target Date"] = (timedelta(days = int(newDF.loc[i,"Standard Duration Days"])) + newDF.loc[i,"Date/Time Created"])
else:
newDF.loc[i,"Duration Target Date"] == ""
I noticed that this works partially but then it eventually stops working... I also get an error when I run this: "KeyError 326"
I would just add the columns and leave the NaT (Not a Time) error.
df = pd.DataFrame({
"Standard Duration Days": [3, 5, "No Set Standard Duration"],
"Date/Time Created": ['2019-01-01', '2019-02-01', '2019-03-01']
})
# 1. Convert string dates to pandas timestamps.
df['Date/Time Created'] = pd.to_datetime(df['Date/Time Created'])
# 2. Create time deltas, coercing errors.
delta = pd.to_timedelta(df['Standard Duration Days'], unit='D', errors='coerce')
# 3. Create new column by adding delta to 'Date/Time Created'.
df['Duration Target Date'] = (df['Date/Time Created'] + delta).dt.normalize()
>>> df
Standard Duration Days Date/Time Created Duration Target Date
0 3 2019-01-01 2019-01-04
1 5 2019-02-01 2019-02-06
2 No Set Standard Duration 2019-03-01 NaT
Adding text to a numeric column converts the entire column to object which takes more memory and is less efficient. Generally, one wants to leave empty values as np.nan or possibly a sentinel value in the case of integers. Only for display purposes do those get converted, e.g. df['Duration Target Date'].fillna('').
A couple issues here. First, it looks like you're confusing loc with iloc. Very easy to do. loc looks up by the actual index, which may or may not be the integer-position index. But your i in range (0, len(newDF)) is iterating by integer-position index. So you're getting your KeyError 326 because you're getting to the 326th row of your dataframe, but it's index is not actually 326. you can check this by looking at print(newDF.iloc[320:330]).
Second and more important issue: you almost never want to iterate through rows in a pandas dataframe. Instead, use a vectorized function that applies to a full column at a time. For your case where you want conditional assignment, the relevant function is np.where:
boolean_filter = newDF.loc[:,"Standard Duration Days"] != "No Set Standard Duration"
value_where_true = (timedelta(days = newDF.loc[:,"Standard Duration Days"].astype('int'))) + newDF.loc[:,"Date/Time Created"])
value_where_false = ""
newDF["Duration Target Date"] = np.where(boolean_filter, value_where_true, value_where_false)
Here's a way using .apply row-wise:
newDF['Standard Duration Days'] = newDF['Standard Duration Days'].astype(int)
newDF['Duration Target Date'] = (newDF
.apply(lambda x:, x["Standard Duration Days"] + x["Date/Time Created"] if x["Standard Duration Days"] != "No Set Standard Duration" else None,axis=1)
Note: Since you haven't provided any data, it is not tested.
I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.
Consider a CSV file:
string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0
I can read this in, and reformat the date column into datetime format:
b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?
What I am currently trying is re-indexing by the date:
b.index = b['date']
I can access the month like so:
b.index.month
However I can't seem to find a function to lump together by month.
Managed to do it:
b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])
Or
b.groupby(pd.Grouper(freq='M')) # update for v0.21+
(update: 2018)
Note that pd.Timegrouper is depreciated and will be removed. Use instead:
df.groupby(pd.Grouper(freq='M'))
To groupby time-series data you can use the method resample. For example, to groupby by month:
df.resample(rule='M', on='date')['Values'].sum()
The list with offset aliases you can find here.
One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.
Normalise day of month
df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
'Values': [5, 10, 15, 20]})
# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)
Then use groupby as normal:
g = df.groupby('YearMonth')
res = g['Values'].sum()
# YearMonth
# 2017-09-01 20
# 2017-10-01 30
# Name: Values, dtype: int64
Comparison with pd.Grouper
The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:
some_group = g.get_group('2017-10-01')
Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.
Comparison with string conversion
An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).
Slightly alternative solution to #jpp's but outputting a YearMonth string:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()