I illustrate my question with the following example.
I have two panda dataframes.
The first with ten second timesteps, which is continuous. Example data for two days:
import pandas as pd
import random
t_10s = pd.date_range(start='1/1/2018', end='1/3/2018', freq='10s')
t_10s = pd.DataFrame(columns = ['b'],
data = [random.randint(0,10) for _ in range(len(t_10s))],
index = t_10s)
The next dataframe have five minute timesteps, but there is only data during daytime, and the logging starts at different times in the morning on each day. Example data for two days, starting at two different times in the morning to resemble the real data:
t_5m1 = pd.date_range(start='1/1/2018 08:08:30', end='1/1/2018 18:03:30', freq='5min')
t_5m2 = pd.date_range(start='1/2/2018 08:10:25', end='1/2/2018 18:00:25', freq='5min')
t_5m = t_5m1.append(t_5m2)
t_5m = pd.DataFrame(columns = ['a'],
data = [0 for _ in range(len(t_5m))],
index = t_5m)
Now what I want to do is for each datapoint, x, in t_5m, to find the equivalent average of the t_10s data, in a five minute window surrounding x.
Now, I have found a way to do this with a list-comprehension as follows:
tstep = pd.to_timedelta(2.5, 'm')
t_5m['avg'] = [t_10s.loc[((t_10s.index >= t_5m.index[i] - tstep) &
(t_10s.index < t_5m.index[i] + tstep))].b.mean() for i in range(0,len(t_5m))]
However, I want to do this for a timeseries spanning at least two years and for many columns (not just b as here. Current solution is to for loop over the relevant columns). The code then gets very slow. Can anyone think of a trick to do this more efficiently? I have thought about using resample or groupby. That would work if I had a regular 5-minute interval, but since it is irregular between days, I cannot make it work. Grateful for any input!
Have looked around some, e.g. here, but couldn't find what I need.
Related
have a dataframe with 1 minute timestamp of open, high, low, close, volume for a token.
using expanding or resample function, one can get a new dataframe based on the timeinterval. in my case its 1 day time interval.
i am looking to get the above output in the original dataframe. please assist in the same.
original dataframe:
desired dataframe:
Here "date_1d" is the time interval for my use case. i used expanding function but as the value changes in "date_1d" column, expanding function works on the whole dataframe
df["high_1d"] = df["high"].expanding().max()
df["low_1d"] = df["low"].expanding().min()
df["volume_1d"] = df["volume"].expanding().min()
then the next challenge was how to find Open and Close based on "date_1d" column
Please assist or ask more questions, if not clear on my desired output.
Fyi - data is huge for 5 years 1 minute data for 100 tokens
thanks in advance
Sukhwant
I'm not sure if I understand it right but for me it looks like you want to groupbyeach day and calculate first last min max for them.
Is the column date_1d already there ?
If not:
df["date_1d"] = df["date"].dt.strftime('%Y%m%d')
For the calculations:
df["open_1d"] = df.groupby("date_1d")['open'].transform('first')
df["high_1d"] = df.groupby("date_1d")['high'].transform('max')
df["low_1d"] = df.groupby("date_1d")['low'].transform('min')
df["close_1d"] = df.groupby("date_1d")['close'].transform('last')
EDIT:
Have a look in your code if this works like you expect it (till we have some of your data I can only guess, sorry :D )
df['high_1d'] = df.groupby('date_1d')['high'].expanding().max().values
It groups the data per "date_1d" but in the group only consider row by row (and the above rows)
EDIT: Found a neat solution using transform method. Erased the need for a "Day" Column as df.groupby is made using index.date attribute.
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", interval="1m",
start=datetime.date.today()-datetime.timedelta(days=6))
df['Open_1d'] = df["Open"].groupby(
df.index.day).transform('first')
df['Close_1d'] = df["Close"].groupby(
df.index.day).transform('last')
df['High_1d'] = df['High'].groupby(
df.index.day).expanding().max().droplevel(level=0)
df['Low_1d'] = df['Low'].groupby(
df.index.day).expanding().min().droplevel(level=0)
df['Volume_1d'] = df['Volume'].groupby(
df['Day']).expanding().sum().droplevel(level=0)
df
Happy Coding!
I want to sample rows from a pandas data frame without replacement. What I mean is this. In each iteration of the for loop, I sample a certain number of rows from COMBINED without replacement. I want to ensure that over 50,000 iterations, I do not ever sample the same row again. My code below tries to solve this sampling problem, but I get errors.
COMBINED,TEMP, MERGED, SAMPLE, SAMPLE_2 AND PROBABILITY_GENERATED_POISSON are data frames. lst is a list.
Please see my code below:
#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
#IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
lst.append(0)
#IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS
else:
SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE['Sample'] = i
#CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
#except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
try:
TEMP = pd.DataFrame(lst)
#PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
MERGED = TEMP.merge(SAMPLE, how = "inner")
#AVOIDING DUPLICATION WITHIN LIST
#IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
if MERGED.shape[0] == 0:
lst.append(SAMPLE)
else:
#IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM
#THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
#WERE SAMPLED IN PREVIOUS ITERATIONS.
COMBINED_2 = COMBINED.subtract(SAMPLE)
SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE_2['Sample'] = i
lst.append(SAMPLE_2)
except:
continue
print(i)
The error I get is attached with the picture.
I would like some feedback on my question.
Thank you.
Here are are two ways to solve:
solution using pandas .sample function
n = 50000
COMBINED.sample(n, replace=False)
solution using a simple algorithm that does the same thing as .sample()
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
I fixed the error. PROBABILITY_GENERATED_POISSON needs to be a list.
I am doing this educational challenge on kaggle https://www.kaggle.com/c/competitive-data-science-predict-future-sales
The training set is a file of daily sales numbers of some products and the test set we need to predict is the sales of similar items for the month of november.
Now I would like to use my model to make daily predictions and thus expand the test data set by 30 for each row.
I have the following code:
for row in test.itertuples():
df = pd.DataFrame(index = nov15, columns = test.columns)
df['shop_id'] = row.shop_id
df['item_category_id'] = row.item_category_id
df['item_price'] = row.item_price
df['item_id'] = row.item_id
df = df.reset_index()
df.columns = ['date', 'item_id', 'shop_id', 'item_category_id', 'item_price']
df = df[train.columns]
tt = pd.concat([tt, df])
nov15 is a pandas daterange from 1/nov/2015 to 30/nov/2015
tt is just an empty dataset I fill by expanding it by 30 rows (nov 1 to 30) for every row in the test set.
test is the original dataframe I am copying the rows from
It runs, but it takes hours to complete.
Knowing pandas and learning from previous experiences, there is probably an efficient way to do this.
Thank you for your help!
So I have found a "more" efficient way, and then someone over at Reddit's r/learnpython has told me about the correct and most efficient way.
This above dilemma is easily solved by pandas explode function.
And these two lines do what I did above, but within seconds:
test['date'] = [nov15 for _ in range(len(test))]
test = test.explode('date')
Now my more efficient way or second solution, which is in no way anywhere close to equivalent or good was to simply make 30 copies of the dataframe with a column 'date' added.
just getting into data visualization with pandas. At the moment i try to visualize a pd with matplotlib that looks like this:
Initiative_160608 Initiative_160570 Initiative_160056
Beschluss_BR 2009-05-15 2009-05-15 2006-04-07
Vorlage_BT 2009-05-22 2009-05-22 2006-04-26
Beratung_BT 2009-05-28 2009-05-28 2006-05-11
ABeschluss_BT 2009-06-17 2009-06-17 2006-05-17
Beschlussempf 2009-06-17 2009-06-17 2006-05-26
As you can see, i have a number of columns with five different dates (every date symbolizes one event in a total chain of five events). Now to the problem:
My plan is to visualize shown data with a stacked horizontal chart, using the timedeltas between the 5 different events (how many days have passed between the first and last event, including the dates in between). Every Column should represent one bar in the chart. The whole chart is not about the absolute time that has passed, but about the duration of the five events in relation to the overall duration of one column, which means that all bars should have the same overall length.
Yet i haven`t found anything similar or found a solution by myself. I would be extremely thankful for any kind of solution to proceed with the shown data.
I'm not exactly sure if this is what you are looking for, but if each column is supposed to be a bar, and you want the time deltas within each column, then you need the difference in days between each row, and I am guessing the first row should have a difference of 0 days (since it is the starting point).
Also for stacked barplots, the index is used to create the categories, but in your case, you want the columns as categories, and each bar to be composed of the different index values. This means you need to transpose your df eventually.
This solution is pretty ugly, but hopefully it helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Initiative_160608": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160570": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160056": ['2006-04-07', '2006-04-26', '2006-05-11', '2006-05-17', '2006-05-26']})
df.index = ['Beschless_BR', 'Vorlage_BT', 'Beratung_BT', 'ABeschless_BT', 'Beschlussempf']
# convert everything to dates
df = df.apply(lambda x: pd.to_datetime(x, format="%Y-%m-%d"))
def get_days(x):
diff_list = []
for i in range(len(x)):
if i == 0:
diff_list.append(x[i] - x[i])
else:
diff_list.append(x[i] - x[i-1])
return diff_list
# get the difference in days, then convert back to numbers
df_diff = df.apply(lambda x: get_days(x), axis = 0)
df_diff = df_diff.apply(lambda x: x.dt.days)
# transpose the matrix so that each initiative becomes a stacked bar
df_diff = df_diff.transpose()
# replace 0 values with 0.2 so that the bars are visible
df_diff = df_diff.replace(0, 0.2)
df_diff.plot.bar(stacked = True)
plt.show()
I want to obtain the time intervals subtracting the time windows from the timeline. Is there an efficient way using pandas Intervals and periods.
I tried to looking for a solution using pandas periods and Intervals class on SO but could not found such maybe because Intervals are immutable objects in pandas (ref https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html#pandas-interval).
I found a relevant solution using 3rd party library Subtract Overlaps Between Two Ranges Without Sets But it does not particularly deals using datetime or Timestamp objects.
import pandas as pd
start = pd.Timestamp('00:00:00')
end = pd.Timestamp('23:59:00')
# input
big_time_interval = pd.Interval(start, end)
smaller_time_intervals_to_subtract = [
(pd.Timestamp('01:00:00'), pd.Timestamp('02:00:00')),
(pd.Timestamp('16:00:00'), pd.Timestamp('17:00:00'))]
# output
_output_time_intervals = [
(pd.Timestamp('00:00:00'), pd.Timestamp('01:00:00')),
(pd.Timestamp('02:00:00'), pd.Timestamp('16:00:00')),
(pd.Timestamp('17:00:00'), pd.Timestamp('23:59:00'))]
output_time_intervals = list(
map(lambda interval: pd.Interval(*interval), _output_time_intervals))
Any help would be appreciated.