I have a large dataset ('df'; ~400,000 lines) of rows with a datetime index describing features of cities.
eg.
df = pd.DataFrame([['2016-01-01 00:00:00','Jacksonville'], ['2016-01-01 01:00:00','Jacksonville'],
['2016-01-01 02:00:00','Jacksonville'], ['2016-01-01 03:00:00','Toronto']], columns=['timestamp','City'])
I want to merge this with another smaller dataset I've created ('public_holidays'; ~300 lines) that lists public holidays for those cities.
eg.
public_holidays = pd.DataFrame([['1/01/2016','New Year\'s Day','Jacksonville'], ['1/01/2016','New Year\'s Day','San Francisco'],
['25/12/2018','Christmas Day','Toronto'], ['26/12/2018','Boxing Day','Toronto']], columns=['timestamp','Holiday','City'])
Currently I've done this:
new_df= pd.merge(df, public_holidays, how = 'left', on = ['timestamp','City'])
This works, however as 'df's timestamp contains every hour of each day, the merge only occures at the hour 00:00 (as 'public_holidays' "timestamp" is only by date).
How can I get 'public_holidays' to map to every row that matches its date, regardless of time?
Many thanks for any assistance.
Add to df an auxiliary column with normalized timestamp:
df['dat'] = df.timestamp.dt.normalize()
Then in merge, instead of on=... pass:
left_on=['dat', 'City'],
right_on=['timestamp', 'City'].
Finally (after the new_df is created) you can drop this auxiliary column.
An alternative is to overwrite timestamp column with the normalized timestamp:
df.timestamp = df.timestamp.dt.normalize()
and perform the merge without any change.
Note: As you failed to include sample data, the above advice is only
"theoretical", not supported by any actual test run.
Related
I'm trying to pull some data from yfinance in Python for different funds from different exchanges. In pulling my data I just set-up the start and end dates through:
start = '2002-01-01'
end = '2022-06-30'
and pulling it through:
assets = ['GOVT', 'IDNA.L', 'IMEU.L', 'EMMUSA.SW', 'EEM', 'IJPD.L', 'VCIT',
'LQD', 'JNK', 'JNKE.L', 'IEF', 'IEI', 'SHY', 'TLH', 'IGIB',
'IHYG.L', 'TIP', 'TLT']
assets.sort()
data = yf.download(assets, start = start, end = end)
I guess you've noticed that the "assets" or the ETFs come from different exchanges such as ".L" or ".SW".
Now the result this:
It seems to me that there is no overlap for a single instrument (i.e. two prices for the same day). So I don't think the data will be disturbed if any scrubbing or clean-up is done.
So my goal is to harmonize or consolidate the prices to its date index rather than date-and-time index so that each price for each instrument is firmly side-by-side each other for a particular date.
Thanks!
If you want the daily last closing price from the yahoo-finance api you could use the interval argument,
yf.download(assets, start=start, end=end, interval="1d")
Solution with Pandas:
Transforming the Index
You have an index where each row is a string representing the datetime. You firstly want to transform those strings to an actual DatetimeIndex where each row will be of type datetime64. This is done in order to easily work with dates in you dataset applying functions from the datetime library. Finally, you pick the date from each datetime64;
data.index = pd.to_datetime(data.index).date
Groupby
Now that you have an index of dates you can groupby on index. Firstly, you want to deal with NaN values. If you want that the closing price is only considered to fill the values within the date itself only you want to apply:
data= data.groupby(data.index).ffill()
Otherwise, if you think that the closing price of (e.g.) the 1st October can be used not only to filter values in the 1st October but also 2nd and 3rd of October which have NaN values, simply apply the ffill() without the groupby;
data= data.ffill()
Lastly, taking last observed record grouping for date (Index); Note that you can apply all the functions you want here, even a custom lambda;
data = data.groupby(data.index).last()
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have a large (+10m rows) dataframe with three columns: sales dates (dtype: datetime64[ns]), customer names and sales per customer. Sales dates include day, month and year in the form yyyy-mm-dd (i.e. 2019-04-19). I discovered the pandas to_period function and like to use the period[A-MAR] dtype. As the business year (ending in March) is different from the calendar year that is exactly what I was looking for. With the to_period function I am able to assign the respective sales dates to the correct business year while avoiding to create new columns with additional information.
I convert the date column as follows:
df_input['Date'] = pd.DatetimeIndex(df_input['Date']).to_period("A-MAR")
Now a peculiar issue arrises when I use pivot_table to aggregate the data and set margins=True. The aggfunc returns the correct values in the output table. However, the results in the last row (total value, created by the margins) are wrong as NaN is shown (or in my case a 0 as I set fill_value = 0). The function I use:
df_output = df_input.pivot_table(index="Customer",
columns = "Date",
values = "Sales",
aggfunc ={"Sales": np.sum},
fill_value = 0,
margins= True)
When I do not convert the dates to a period but use a simple year (integer) instead, the margins are calculated correctly and no NaN appears in the last row of the pivot output table.
I searched all over the internet but could not find a solution that was working. I would like to keep working with the period datatype and just need the margins to be calculated correctly. I hope someone can help me out here. Thank you!
I have a time series hourly_df, containing some hourly data:
import pandas as pd
import numpy as np
hourly_index = pd.date_range(start='2018-01-01', end='2018-01-07', freq='H')
hourly_data = np.random.rand(hourly_index.shape[0])
hourly_df = pd.DataFrame(hourly_data, index=hourly_index)
and I have a DatetimeIndex, containing some dates (as days as I wish), e.g.
daily_index = pd.to_datetime(['2018-01-01', '2018-01-05', '2018-01-06'])
I want to select each row of hourly_df, which date of its index is in daily_index, so in my case all hourly data from 1st, 5th and 6th January. What is the best way to do this?
If I naively use hourly_df.loc[daily_index], I only get the rows at 0:00:00 for each of the three days. What I want is the hourly data for the whole day for each of the days in daily_index.
One possibility to solve this, is to create a filter that takes the date of each element in the index of hourly_df and compares whether of not this date is in daily_index.
day_filter = [hour.date() in daily_index.date for hour in hourly_df.index]
hourly_df[day_filter]
This produces the desired output, but it seems the usage of the filter is avoidable and can be done in an expression similar to hourly_df.loc[daily_index.date].
save the daily_index as a dataframe
merge on index using hourly_df.merge(daily_index, how = 'inner', ...)
I am starting out learning this wonderful tool, and I am stuck at the simple task of loading several time series and aligning them with a "master" date vector.
For example: I have a csv file: Data.csv where the first row contains the headers "Date1, Rate1, Date2, Rate2" where Date1 is the dates of the Rate1 and Date2 are the dates of Rate2.
In this case, Rate2 has more observations (the start date is the same as Date1, but the end date is furhter apart then the end date in Date1, and there are less missing values), and everything should be indexed according to Date2.
What is the preferred way to get the following DataFrame? (or accomplishing something similar)
index(Date2) Rate1 Rate2
11/12/06 1.5 1.8
12/12/06 NaN 1.9
13/12/06 1.6 1.9
etc
etc
11/10/06 NaN 1.2
12/10/06 NaN 1.1
13/10/06 NaN 1.3
I have tried to follow the examples in the official pandas.pdf and Googling, but to no avail. (I even bought the Pre-Edition of Mr McKinney´s Pandas book, but the chapters concering Pandas where not ready yet :( )
Is there a nice recipe for this?
Thank you very much
EDIT: Concerning the answer of separating the series into two .CSV files:
But what if I have very many time series, e.g
Date1 Rate1 Date2 Rate2 ... DateN RateN
And all I know is that the dates should be almost the same, with exceptions coming from series that contain missing values (where there is no Date or Rate entry) (this would be an example of some financial economics time series, by the way)
Is the preferred way to load this dataset still to split every series into a separate .CSV?
EDIT2 archlight is completely right, just doing "csv_read" will mess things up.
Essentially my question would then boil down to: how to join several unaligned time series, where each series has a date column, and column for the series itself (.CSV file exported from Excel)
Thanks again
I don't think splitting up the data into multiple files is necessary. How about loading the file with read_csv and converting each date/rate pair into a separate time series? So your code would look like:
data = read_csv('foo.csv')
ts1 = Series(data['rate1'], index=data['date1'])
ts2 = Series(data['rate2'], index=data['date2'])
Now, to join then together and align the data in a DataFrame, you can do:
frame = DataFrame({'rate1': ts1, 'rate2': ts2})
This will form the union of the dates in ts1 and ts2 and align all of the data (inserting NA values where appropriate).
Or, if you have N time series, you could do:
all_series = {}
for i in range(N):
all_series['rate%d' % i] = Series(data['rate%d' % i], index=data['date%d' % i])
frame = DataFrame(all_series)
This is a very common pattern in my experience
if you are sure that Date1 is subset of Date2 and Date2 contains no empty value, you can simply do
df = read_csv('foo.csv', index_col=2, parse_dates=True)
df = df[["rate1", "rate2"]]
but it will be complicated if Date2 has date which Date1 doesn't have. I suggest you put date/rate pair in separate files with date as common header
df1 = read_csv('foo1.csv', index_col=0, parse_dates=True)
df2 = read_csv('foo2.csv', index_col=0, parse_dates=True)
df1.join(df2, how="outer")
EDIT:
This method doesn't look good. so for your NaN in your datetime, you can do sth like
dateindex2 = map(lambda x: datetime(int("20"+x.split("/")[2]), int(x.split("/")[0]), int(x.split("/")[1])), filter(notnull, df['Date2'].values))
ts2 = Series(df["Rate2"].dropna(), index=dateindex2)
#same for ts1
df2 = DataFrame({"rate1":ts1, "rate2":ts2})
the thing is you have to make sure that there is case like date exists but not rate. because dropna() will shift records and mismatch with index