I have a DataFrame that tracks the 'Adj Closing' price for several global markets causing there to be repeating dates. To clean this up I use .set_index(['Index Ticker', 'Date']).
DataFrame sample
My issue is that the Closing Prices run as far back as 1997-07-02 but I only need 2020-01-01 and forward. I tried using idx = pd.IndexSlice followed by df.loc[idx[ :, '2020-01-01':], :] as well as df.loc[(slice(None), '2020-01-01':), :], but both methods return a syntax error on the : that I'm using to slice across a range of dates. Any tips on getting the data I need past a specific date? Thank you in advance!
Try:
# create dataframe to approximate your data
df = pd.DataFrame({'ticker' : ['A']*5 + ['M']*5,
'Date' : pd.date_range(start='2021-01-01', periods=5).tolist() + pd.date_range(start='2021-01-01', periods=5).tolist(),
'high' : range(10)}
).groupby(['ticker', 'Date']).sum()
high
ticker Date
A 2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
M 2021-01-01 5
2021-01-02 6
2021-01-03 7
2021-01-04 8
2021-01-05 9
# evaluate conditions against level 1 (Date) of your multiIndex; level 0 is ticker
df[df.index.get_level_values(1) > '2021-01-03']
high
ticker Date
A 2021-01-04 3
2021-01-05 4
M 2021-01-04 8
2021-01-05 9
Alternatively, if possible, remove the unwanted dates prior to setting your multiIndex.
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
Thankfully, the contents of the last question were solved well, so I was making a dataset without any problems.(Thank you Ecker!!)
But, there was a new problem.
In the process of calculating the cumulative weighted average through the same process,
When the dataset and type were changed, there were cases where it was not possible to join.
For example, in the dataset below
firm
date
reviewer
compound
A
2021-01-01
a
0.6531
A
2021-01-01
b
-0.7213
A
2021-01-01
c
-0.3168
A
2021-01-02
d
0.3548
A
2021-01-02
e
0.5783
A
2021-01-03
f
0.4298
A
2021-01-04
g
0.8769
B
2021-01-01
h
0.7895
B
2021-01-01
i
-0.4924
B
2021-01-02
j
0.0245
B
2021-01-02
k
0.4982
B
2021-01-03
a
0.1597
B
2021-01-04
b
0.6254
‘The compound value’ is a real number (float64) including a decimal point between -1 and 1.
‘The count number’ is the number of reviewers on a specific date.(int64)
I would like to add a column that calculates the cumulative weighted average as shown in the table below.
firm
date
reviewer
rate
cum_avg_compound
A
2021-01-01
a
0.6531
-0.12833
A
2021-01-01
b
-0.7213
-0.12833
A
2021-01-01
c
-0.3168
-0.12833
A
2021-01-02
d
0.3548
0.10962
A
2021-01-02
e
0.5783
0.10962
A
2021-01-03
f
0.4298
0.162983
A
2021-01-04
g
0.8769
0.264971
B
2021-01-01
h
0.7895
0.14855
B
2021-01-01
i
-0.4924
0.14855
B
2021-01-02
j
0.0245
0.20495
B
2021-01-02
k
0.4982
0.20495
B
2021-01-03
a
0.1597
0.1959
B
2021-01-04
b
0.6254
0.26748
Even if this is converted to the same float64 format, it cannot be combined using join.
The code I tried is as follows.
g = (
df.groupby(['firm', 'date'])['compound']
.agg(['sum', 'count'])
.groupby(level='sid').cumsum()
)
df = df.join(
g ['sum'].div(g_text_emo['count']).rename('cum_avg_compound'),
on=['firm', 'date']
)
Is there any way to solve this problem?
Thank you in advance.
Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source
You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
Question
Given a table (DataFrame) of events, where each event (row) has its datetime of start and datetime of stop and the category of event.
How can I transform this table into a table where each row is a combination of all days and categories with the associated hours on this day for this category of event?
Example
Maybe it's easier to see an example than explain the problem:
I want to transform this DataFrame
datetime_start
datetime_end
event_category
2021-01-01 10:30:00
2021-01-03 16:30:00
'A'
2021-01-01 09:00:00
2021-01-01 15:30:00
'B'
2021-01-01 22:00:00
2021-01-01 23:00:00
'B'
Into this DataFrame
date
event_category
sum_of_hours_with_event_active
2021-01-01
'A'
13.5
2021-01-01
'B'
7.5
2021-01-02
'A'
24
2021-01-02
'B'
0
2021-01-03
'A'
16.5
2021-01-03
'B'
0
If you are certain there are no overlapping time periods on the same day within the same event category (or you want to double count those time periods) then you can create the basis of all dates by event categories and merge your timespans onto that DataFrame.
Then by subtracting with clipping we can calculate the total time that event contributes for that day only (resulting negative values don't correspond to that day so they get clipped to 0). Finally, we can sum within day by event.
import pandas as pd
# Enumerate all categories for every day.
dfb = pd.merge(pd.DataFrame({'event_category': df['event_category'].unique()}),
pd.DataFrame({'date': pd.date_range(df.datetime_start.dt.normalize().min(),
df.datetime_end.dt.normalize().max(), freq='D')}),
how='cross')
# Merge timespans
m = dfb.merge(df, on='event_category')
# Calculate time for that day
m['sum_hours'] = ((m['datetime_end'].clip(upper=m['date']+pd.offsets.DateOffset(days=1))
- m['datetime_start'].clip(lower=m['date']))
.clip(lower=pd.Timedelta(0)))
# Sum of hours for event by day
m = (m.groupby(['event_category', 'date'])['sum_hours']
.sum().dt.total_seconds().div(3600)
.reset_index())
print(m)
event_category date sum_hours
0 A 2021-01-01 13.5
1 A 2021-01-02 24.0
2 A 2021-01-03 16.5
3 B 2021-01-01 7.5
4 B 2021-01-02 0.0
5 B 2021-01-03 0.0
Data
import pandas as pd
start_times = pd.DatetimeIndex(['2021-01-01 10:30:00', '2021-01-01 09:00:00', '2021-01-01 22:00:00'])
end_times = pd.DatetimeIndex(['2021-01-03 16:30:00', '2021-01-01 15:30:00', '2021-01-01 23:00:00'])
categories = ['A', 'B', 'B']
df = pd.DataFrame({'datetime_start': start_times, 'datetime_end': end_times, 'event_category': categories})
Answer
First we groupby event_category so that the apply works per category. The concatenation of the two series represents the changes in the events, that is, the beginnings and ends of events. The groupby and sum inside the apply are needed in case there are multiple events which start or end at the same time in the same category. The cumulative sum (cumsum) gives the total number of events at the times that there were changes, that is, at the times when one or more event started or ended. Next we upsample with asfreq to the desired frequency. This should be at least equal to the time granularity of the data. Finally we resample again (implemented with groupby and Grouper objects) and sum.
Essentially we are counting the number of periods occupied by all the events in each category and multiplying by the length of a period (half hour in the example) and then grouping by day. The DateOffset object is used to parametrize the period.
step = pd.DateOffset(hours=0.5) # Half hour steps
df.groupby('event_category') \
.apply(lambda x: pd.concat([pd.Series(1, x['datetime_start']),
pd.Series(-1, x['datetime_end'])]) \
.groupby(level=0) \
.sum() \
.cumsum() \
.asfreq(step, method='ffill')
) \
.groupby([pd.Grouper(level=0), pd.Grouper(level=1, freq='D')]) \
.sum() * step.hours
This will work for overlapping events in the same category.
Results
event_category
A 2021-01-01 13.5
2021-01-02 24.0
2021-01-03 16.5
B 2021-01-01 7.5
dtype: float64
I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^
First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))
I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00