I'm having trouble selecting data in a dataframe dependent on an hour.
I have a months worth of data which increases in 10min intervals.
I would like to be able to select the data (creating another dataframe) for each hour in a specific day for each hour. However, I am having trouble creating an expression.
This is how I did it to select the day:
x=all_data.resample('D').index
for day in range(20):
c=x.day[day]
d=x.month[day]
print data['%(a)s-%(b)s-2009' %{'a':c, 'b':d} ]
but if I do it for hour, it will not work.
x=data['04-09-2009'].resample('H').index
for hour in range(8):
daydata=data['4-9-2009 %(a)s' %{'a':x.hour[hour]}]
I get the error:
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named 4-9-2009 0'
which is true as it is in format dd/mm/yyy hh:mm:ss
I'm sure this should be easy and something to do with resample. The trouble is I don't want to do anything with the dat, just select the data frame (to correlate it afterwards)
Cheers
You don't need to resample your data unless you want to aggregate into a daily value (e.g., sum, max, median)
If you just want a specific day's worth of data, you can use to the follow example of the .loc attribute to get started:
import numpy
import pandas
N = 3700
data = numpy.random.normal(size=N)
time = pandas.DatetimeIndex(freq='10T', start='2013-02-15 14:30', periods=N)
ts = pandas.Series(data=data, index=time)
ts.loc['2013-02-16']
The great thing about using .loc on a time series is that you can be a general or specific as you want with the dates. So for a particular hour, you'd say:
ts.loc['2013-02-16 13'] # notice that i didn't put any minutes in there
Similarly, you can pull out a whole month with:
ts.loc['2013-02']
The issue you're having with the string formatting is that you're manually padding the string with a 0. So if you have a 2-digit hour (i.e. in the afternoon) you end up with a 3-digit representation of the hours (and that's not valid). SO if I wanted to loop through a specific set of hours, I would do:
hours = [2, 7, 12, 22]
for hr in hours:
print(ts.loc['2013-02-16 {0:02d}'.format(hr)])
The 02d format string tell python to construct a string from a digit (integer) that is least two characters wide and the pad the string with a 0 of the left side if necessary. Also you probably need to format your date as YYYY-mm-dd instead of the other way around.
Related
I have data of 7 months on an hourly and minutes basis. I want to drop night time data (7:30pm to 5:10am) from everyday.
If you are using a datetime as index, you don't need to use dt. Also, dt.hour returns an integer value. But you are using an integer value with a string expression. You can use like this:
df2=df.loc[(df.index.hour >= 5) & (df.index.hour <= 19)]
but there is simple way. Use between_time():
df=df.between_time('05:10:00', '19:30:00')
I'm trying to pull some data from yfinance in Python for different funds from different exchanges. In pulling my data I just set-up the start and end dates through:
start = '2002-01-01'
end = '2022-06-30'
and pulling it through:
assets = ['GOVT', 'IDNA.L', 'IMEU.L', 'EMMUSA.SW', 'EEM', 'IJPD.L', 'VCIT',
'LQD', 'JNK', 'JNKE.L', 'IEF', 'IEI', 'SHY', 'TLH', 'IGIB',
'IHYG.L', 'TIP', 'TLT']
assets.sort()
data = yf.download(assets, start = start, end = end)
I guess you've noticed that the "assets" or the ETFs come from different exchanges such as ".L" or ".SW".
Now the result this:
It seems to me that there is no overlap for a single instrument (i.e. two prices for the same day). So I don't think the data will be disturbed if any scrubbing or clean-up is done.
So my goal is to harmonize or consolidate the prices to its date index rather than date-and-time index so that each price for each instrument is firmly side-by-side each other for a particular date.
Thanks!
If you want the daily last closing price from the yahoo-finance api you could use the interval argument,
yf.download(assets, start=start, end=end, interval="1d")
Solution with Pandas:
Transforming the Index
You have an index where each row is a string representing the datetime. You firstly want to transform those strings to an actual DatetimeIndex where each row will be of type datetime64. This is done in order to easily work with dates in you dataset applying functions from the datetime library. Finally, you pick the date from each datetime64;
data.index = pd.to_datetime(data.index).date
Groupby
Now that you have an index of dates you can groupby on index. Firstly, you want to deal with NaN values. If you want that the closing price is only considered to fill the values within the date itself only you want to apply:
data= data.groupby(data.index).ffill()
Otherwise, if you think that the closing price of (e.g.) the 1st October can be used not only to filter values in the 1st October but also 2nd and 3rd of October which have NaN values, simply apply the ffill() without the groupby;
data= data.ffill()
Lastly, taking last observed record grouping for date (Index); Note that you can apply all the functions you want here, even a custom lambda;
data = data.groupby(data.index).last()
I have a csv-file (called "cameradata") with columns MeetingRoomID, and Time (There are more columns, but they should not be needed).
I would like to get the number of occurrences a certain MeetingRoomID ("14094020", from the column "MeetingRoomID") is used during one day. The csv-file luckily only consist of timestamps from one day in the "Time" column. One problem is that the timestamps are in the datetime-format %H:%M:%S and I want to categorize the occurrences by the hour it occured (between 07:00-18:00).
The goal is to have the occurences linked to the hours of the timestamps - in order for me to plot a barplot with (x = "timestamps (hourly)" and y = "a dataframe/series that maps the certain MeetingRoomID with the hour it was used".
How can I get a function for my y-axis that understands that the value_count for ID 14094020 and the timestamps are connected?
So far I've come up with something like this:
y = cameradata.set_index('Time').resample('H')
cameradata['MeetingRoomID'].value_counts()[14094020]
My code seems to work if I divide it, but I do not know how to connect it in a syntax-friendly way.
Clarification:
The code: cameradata['MeetingRoomID'].value_counts().idxmax() revealed the ID with the most occurrences, so I think I'm onto something there.
Grateful for your help!
This is how the print of the Dataframe looks like, 'Tid' is time and 'MätplatsID' is what I called MeetingRoomID.
For some reason the "Time" column has added a made-up year and month next to it when I converted it to datetime. I converted in to datetime by: kameradata['Tid'] = pd.to_datetime(kameradata['Tid'], format=('%H:%M:%S'))
This is an example of how the output look like in the end
I took difference of two columns, each of type pandas._libs.tslibs.period.Period. The result is of pandas.tseries.offsets.Day datatype. Now, I want to use the integer value of calculated time difference to do other calculations. How to do that?
I want last column values to be simply integers
Here is what i have tried.
## Check if all dates are in same format and take time upto days only, which will be suitable for given data
data_dates['ExaminDate'] = pd.to_datetime(data_dates["ExaminDate"],errors='coerce', infer_datetime_format= True)
data_dates["DeathDate"] = pd.to_datetime(data_dates["DeathDate"],errors='coerce',infer_datetime_format= True)
data_dates['ExaminMY']= data_dates['ExaminDate'].dt.to_period('D')
data_dates['DeathMY']= data_dates['DeathDate'].dt.to_period('D')
## Make a new column representing time of observation for each patient, which will be difference of two columns (ExaminDate and DeathDate)
data_dates['Time(days)'] = data_dates['DeathMY'] - data_dates['ExaminMY']
It's unclear why you choose to convert your dates to time periods in the first place - it prevents you from achieving the goal of calculating the time difference (in days) between two dates. The following two lines should, therefore, be removed:
data_dates['ExaminMY']= data_dates['ExaminDate'].dt.to_period('D')
data_dates['DeathMY']= data_dates['DeathDate'].dt.to_period('D')
Explanation: with Period objects and there's no clear definition of what's the time difference (in days or otherwise) between two time periods (e.g. Q42019 and Q12020). You could be referring to the starting date, the end-date, or some combination of the above. Plus, periods (offsets, really) like '1 month' or '1 quarter` can differ in the number of days they contain.
If what's you're interested in is the time difference, in days, between DeathDate and ExaminDate, just do the calculation on the original DateTime fields:
# I don't think you need these three lines, as you're reading the date from a file. It's just
# to make sure the example works.
df = pd.DataFrame({"ExamineDate": ['2020-01-15'], "DeathDate": ["2020-04-20"]})
df.ExamineDate = pd.to_datetime(df.ExamineDate)
df.DeathDate = pd.to_datetime(df.DeathDate)
# This is where the real stuff begins
df["days_diff"] = df.DeathDate - df.ExamineDate
df["days_diff_int"] = df.days_diff.dt.days
print (df)
The result is:
ExamineDate DeathDate days_diff days_diff_int
0 2020-01-15 2020-04-20 96 days 96
So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()