I have a DataFrame with a column of the time and a column in which I have stored a time lag. The data looks like this:
2020-04-18 14:00:00 0 days 03:00:00
2020-04-19 02:00:00 1 days 13:00:00
2020-04-28 14:00:00 1 days 17:00:00
2020-04-29 20:00:00 2 days 09:00:00
2020-04-30 19:00:00 2 days 11:00:00
Time, Length: 282, dtype: datetime64[ns] Average time lag, Length: 116, dtype: object
I want to plot the Time on the x-axis vs the time lag on the y-axis. However, I keep having errors with plotting the second column. Any tips on how to handle this data for the plot?
Related
I am trying to extract hour, minutes and seconds from below column data frame.
0 days 09:30:00
0 days 10:00:00
0 days 10:30:00
0 days 11:00:00
0 days 11:30:00
0 days 12:00:00
0 days 12:30:00
0 days 01:00:00
0 days 01:30:00
From this to below format i want to remove this "0 days" from the data frame column
09:30:00
10:00:00
10:30:00
11:00:00
11:30:00
12:00:00
12:30:00
01:00:00
01:30:00
If you use the .split() method and pass in a space, it will return a list of all the words in the string. i.e: ['0', 'days', '09:30:00']. Then, you just have to return the third entry, assuming that everything follows this format.
def foo(string):
return string.split(' ')[2]
Using series.apply() it will look like this
df['newcolumn'] = df['originalcolumn'].apply(lambda x: x[-8:])
I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]
I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64
I am working on a data frame with DateTimeIndex of hourly temperature data spanning a couple of years. I want to add a column with the minimum temperature between 20:00 of a day and 8:00 of the following day. Daytime temperatures - from 8:00 to 20:00 - are not of interest. The result can either be at the same hourly resolution of the original data or be resampled to days.
I have researched a number of strategies to solve this, but am unsure about the most efficienct (in terms of primarily coding efficiency and secondary computing efficiency) respectively pythonic way to do this. Some of the possibilities I have come up with:
Attach a column with labels 'day', 'night' depending on df.index.hour and use group_by or df.loc to find the minimum
Resample to 12h and drop every second value. Not sure how I can make the resampling period start at 20:00.
Add a multi-index - I guess this is similar to approach 1, but feels a bit over the top for what I'm trying to achieve.
Use df.between_time (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html#pandas.DataFrame.between_time) though I'm not sure if the date change over midnight will make this a bit messy.
Lastly there is some discussion about combining rolling with a stepping parameter as new pandas feature: https://github.com/pandas-dev/pandas/issues/15354
Original df looks like this:
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21 #<-- minimum for the night 2009-06-30 (previous date since periods starts 2009-06-30 20:00)
... ...
2019-06-24 22:00:00 14.03 #<-- minimum for the night 2019-06-24
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25
I want to get something like this (min temp from day 20:00 to day+1 8:00):
datetime temp
2009-06-30 23:00:00 16.21
2009-07-01 00:00:00 16.21
2009-07-01 01:00:00 16.21
2009-07-01 02:00:00 16.21
2009-07-01 03:00:00 16.21
... ...
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 14.03
2019-06-25 00:00:00 14.03
2019-06-25 01:00:00 14.03
or a bit more succinct:
datetime temp
2009-06-30 16.21
... ...
2019-06-24 14.03
Use the base option to resample:
rs = df.resample('12h', base=8).min()
Then keep only the rows for 20:00:
rs[rs.index.hour == 20]
you can use TimeGrouper with freq=12h and base=8 to chunk the dataframe every 12h from 20:00 - (+day)08:00,
then you can just use .min()
try this:
import pandas as pd
from io import StringIO
s = """
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25"""
df = pd.read_csv(StringIO(s), sep="\s\s+")
df['datetime'] = pd.to_datetime(df['datetime'])
result = df.sort_values('datetime').groupby(pd.Grouper(freq='12h', base=8, key='datetime')).min()['temp'].dropna()
print(result)
Output:
datetime
2009-06-30 20:00:00 16.21
2019-06-24 20:00:00 14.03
Name: temp, dtype: float64
I have a df of crypto data and am trying to see if there is a particular time of the day/week when prices move one way or the other. I have the time stamp, day of the week and return from the previous time stamps close, as is the case in the example data below.
Date Day Return
2019-06-22 01:00:00 Saturday -0.046910
2019-06-22 07:00:00 Saturday -0.018756
2019-06-22 13:00:00 Saturday 0.036842
2019-06-22 19:00:00 Saturday 0.000998
2019-06-23 01:00:00 Sunday 0.017672
2019-06-23 07:00:00 Sunday 0.021102
2019-06-23 13:00:00 Sunday -0.014737
2019-06-23 19:00:00 Sunday -0.039085
2019-06-24 01:00:00 Monday 0.009690
2019-06-24 07:00:00 Monday -0.004367
2019-06-24 13:00:00 Monday -0.005342
2019-06-24 19:00:00 Monday 0.001060
2019-06-25 01:00:00 Tuesday -0.027738
2019-06-25 07:00:00 Tuesday -0.001599
2019-06-25 13:00:00 Tuesday 0.006247
2019-06-25 19:00:00 Tuesday -0.036937
2019-06-26 01:00:00 Wednesday -0.064866
2019-06-26 07:00:00 Wednesday 0.012319
My first issue is the time stamp is confusing. As I get data from different exchanges the time stamp is different across a lot of them so I have abandoned the idea of trying to standardise the Date column and would now just like a new column that numbers the period in each day. So the first 6 hours in each saturday would be Saturday_1 and so on. So in the end I would have 28 different categories (4 time periods x 7 days in the week).
What I would then like is to groupby this new column, and have returned to me the average return for each category as it were.
Cheers
Assuming that your Day column is correct:
# ignore if already datetime
df.Date = pd.to_datetime(df.Date)
# hour block in the day
s = df.Date.dt.hour//6 + 1
# new column
df['group'] = df['Day'] + '_' + s.astype(str)
output:
0 Saturday_1
1 Saturday_2
2 Saturday_3
3 Saturday_4
4 Sunday_1
5 Sunday_2
6 Sunday_3
7 Sunday_4
8 Monday_1
9 Monday_2
10 Monday_3
11 Monday_4
12 Tuesday_1
13 Tuesday_2
14 Tuesday_3
15 Tuesday_4
16 Wednesday_1
17 Wednesday_2
Name: group, dtype: object