I am starting out learning this wonderful tool, and I am stuck at the simple task of loading several time series and aligning them with a "master" date vector.
For example: I have a csv file: Data.csv where the first row contains the headers "Date1, Rate1, Date2, Rate2" where Date1 is the dates of the Rate1 and Date2 are the dates of Rate2.
In this case, Rate2 has more observations (the start date is the same as Date1, but the end date is furhter apart then the end date in Date1, and there are less missing values), and everything should be indexed according to Date2.
What is the preferred way to get the following DataFrame? (or accomplishing something similar)
index(Date2) Rate1 Rate2
11/12/06 1.5 1.8
12/12/06 NaN 1.9
13/12/06 1.6 1.9
etc
etc
11/10/06 NaN 1.2
12/10/06 NaN 1.1
13/10/06 NaN 1.3
I have tried to follow the examples in the official pandas.pdf and Googling, but to no avail. (I even bought the Pre-Edition of Mr McKinney´s Pandas book, but the chapters concering Pandas where not ready yet :( )
Is there a nice recipe for this?
Thank you very much
EDIT: Concerning the answer of separating the series into two .CSV files:
But what if I have very many time series, e.g
Date1 Rate1 Date2 Rate2 ... DateN RateN
And all I know is that the dates should be almost the same, with exceptions coming from series that contain missing values (where there is no Date or Rate entry) (this would be an example of some financial economics time series, by the way)
Is the preferred way to load this dataset still to split every series into a separate .CSV?
EDIT2 archlight is completely right, just doing "csv_read" will mess things up.
Essentially my question would then boil down to: how to join several unaligned time series, where each series has a date column, and column for the series itself (.CSV file exported from Excel)
Thanks again
I don't think splitting up the data into multiple files is necessary. How about loading the file with read_csv and converting each date/rate pair into a separate time series? So your code would look like:
data = read_csv('foo.csv')
ts1 = Series(data['rate1'], index=data['date1'])
ts2 = Series(data['rate2'], index=data['date2'])
Now, to join then together and align the data in a DataFrame, you can do:
frame = DataFrame({'rate1': ts1, 'rate2': ts2})
This will form the union of the dates in ts1 and ts2 and align all of the data (inserting NA values where appropriate).
Or, if you have N time series, you could do:
all_series = {}
for i in range(N):
all_series['rate%d' % i] = Series(data['rate%d' % i], index=data['date%d' % i])
frame = DataFrame(all_series)
This is a very common pattern in my experience
if you are sure that Date1 is subset of Date2 and Date2 contains no empty value, you can simply do
df = read_csv('foo.csv', index_col=2, parse_dates=True)
df = df[["rate1", "rate2"]]
but it will be complicated if Date2 has date which Date1 doesn't have. I suggest you put date/rate pair in separate files with date as common header
df1 = read_csv('foo1.csv', index_col=0, parse_dates=True)
df2 = read_csv('foo2.csv', index_col=0, parse_dates=True)
df1.join(df2, how="outer")
EDIT:
This method doesn't look good. so for your NaN in your datetime, you can do sth like
dateindex2 = map(lambda x: datetime(int("20"+x.split("/")[2]), int(x.split("/")[0]), int(x.split("/")[1])), filter(notnull, df['Date2'].values))
ts2 = Series(df["Rate2"].dropna(), index=dateindex2)
#same for ts1
df2 = DataFrame({"rate1":ts1, "rate2":ts2})
the thing is you have to make sure that there is case like date exists but not rate. because dropna() will shift records and mismatch with index
Related
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)
I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]
Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])
I've created a pandas dataframe from a 205MB csv (approx 1.1 million rows by 15 columns). It holds a column called starttime that is dtype object (it's more precisely a string). The format is as follows: 7/1/2015 00:00:03.
I would like to create two new dataframes from this pandas dataframe. One should contain all rows corresponding with weekend dates, the other should contain all rows corresponding with weekday dates.
Weekend dates are:
weekends = ['7/4/2015', '7/5/2015', '7/11/2015', '7/12/2015',
'7/18/2015', '7/19/2015', '7/25/2015', '7,26/2015']
I attempted to convert the string to datetime (pd.to_datetime) hoping that would make the values easier to parse, but when I do it hangs for so long that I ended up restarting the kernel several times.
Then I decided to use df["date"], df["time"] = zip(*df['starttime'].str.split(' ').tolist()) to create two new columns in the original dataframe (one for date, one for time). Next I figured I'd use a boolean test to 'flag' weekend records (according to the new date field) as True and all others False and create another column holding those values, then I'd be able to group by True and False.
For example,
test1 = bikes['date'] == '7/1/2015' returns True for all 7/1/2015 values, but I can't figure out how to iterate over all items in weekends so that I get True for all weekend dates. I tried this and broke Python (hung again):
for i in weekends:
for k in df['date']:
test2 = df['date'] == i
I'd appreciate any help (with both my logic and my code).
First, create a DataFrame of string timestamps with 1.1m rows:
df = pd.DataFrame({'date': ['7/1/2015 00:00:03', '7/1/2015 00:00:04'] * 550000})
Next, you can simply convert them to Pandas timestamps as follows:
df['ts'] = pd.to_datetime(df.date)
This operation took just under two minutes. However, it took under seven seconds if you specify the format:
df['ts'] = pd.to_datetime(df.date, format='%m/%d/%Y %H:%M:%S')
Now, it is easy to set up a weekend flag as follows (which took about 3 seconds):
df['weekend'] = [d.weekday() >= 5 for d in df.ts]
Finally, it is easy to subset your DataFrame, which takes virtually no time:
df_weekdays = df.loc[~df.weekend, :]
df_weekends = df.loc[df.weekend, :]
The weekend flag is to help explain what is happening. You can simplify as follows:
df_weekdays = df.loc[df.ts.apply(lambda ts: ts.weekday() < 5), :]
df_weekends = df.loc[df.ts.apply(lambda ts: ts.weekday() >= 5), :]