I have the following dataframe, named 'ORDdataM', with a DateTimeIndex column 'date', and a price point column 'ORDprice'. The date column has no timezone associated with it (and is naive) but is actually in 'Australia/ACT'. I want to convert it into 'America/New_York' time.
ORDprice
date
2021-02-23 18:09:00 24.01
2021-02-23 18:14:00 23.91
2021-02-23 18:19:00 23.98
2021-02-23 18:24:00 24.00
2021-02-23 18:29:00 24.04
... ...
2021-02-25 23:44:00 23.92
2021-02-25 23:49:00 23.88
2021-02-25 23:54:00 23.92
2021-02-25 23:59:00 23.91
2021-02-26 00:09:00 23.82
The line below is one that I have played around with quite a bit, but I cannot figure out what is erroneous. The only error message is:
KeyError: 'date'
ORDdataM['date'] = ORDdataM['date'].dt.tz_localize('Australia/ACT').dt.tz_convert('America/New_York')
I have also tried
ORDdataM.date = ORDdataM.date.dt.tz_localize('Australia/ACT').dt.tz_convert('America/New_York')
What is the issue here?
Your date is index not a column, try:
df.index = df.index.tz_localize('Australia/ACT').tz_convert('America/New_York')
df
# ORDprice
#date
#2021-02-23 02:09:00-05:00 24.01
#2021-02-23 02:14:00-05:00 23.91
#2021-02-23 02:19:00-05:00 23.98
#2021-02-23 02:24:00-05:00 24.00
#2021-02-23 02:29:00-05:00 24.04
#2021-02-25 07:44:00-05:00 23.92
#2021-02-25 07:49:00-05:00 23.88
#2021-02-25 07:54:00-05:00 23.92
#2021-02-25 07:59:00-05:00 23.91
#2021-02-25 08:09:00-05:00 23.82
Related
I have a data frame like so
Date_Time
Level
2017-08-08 23:55:01
239.0
2017-08-08 23:50:01
242.0
2017-08-08 23:45:01
246.0
2017-08-08 23:40:01
250.0
2017-08-08 23:35:01
254.0
...
...
2017-07-26 00:23:57
72.0
2017-07-26 00:18:57
67.0
2017-07-26 00:13:57
64.0
2017-07-26 00:08:57
64.0
2017-07-26 00:03:57
65.0
I want to calculate the average level on every day, during the waking hours and overnight hours
Date
Time
AvgLevel
2017-08-08
00:00:00 - 06:00:00
178
2017-08-08
06:00:01 - 23:59:99
190
2017-09-08
00:00:00 - 06:00:00
174
2017-09-08
06:00:01 - 23:59:99
200
I've already tried splitting into separate tables and using for loops however that uses too much memory and takes too much time
You can do the following:
import pandas as pd
df = pd.read_csv("data.csv", sep=";")
print(df)
df["Date_Time"] = pd.to_datetime(df["Date_Time"])
df["Date"] = df["Date_Time"].dt.date
df["Time"] = df["Date_Time"].dt.time
df["Time_Period"] = "Overnight"
df.loc[(df["Time"] >= pd.to_datetime("06:00:00").time()) & (df["Time"] <= pd.to_datetime("23:59:59").time()), "Time_Period"] = "Waking"
grouped = df.groupby(["Date", "Time_Period"])["Level"].mean().reset_index()
grouped = grouped.rename(columns={"Date": "Date", "Time_Period": "Time", "Level": "AvgLevel"})
grouped["Time"] = grouped["Time"].map({
"Waking": "06:00:01 - 23:59:99",
"Overnight": "00:00:00 - 06:00:00"
})
print(grouped)
Basically, you group entries by times that are nightly and daily:
This results in (I assume here that your expected outcome you print is for the entire dataframe):
Date Time AvgLevel
0 2017-07-26 00:00:00 - 06:00:00 66.4
1 2017-08-08 06:00:01 - 23:59:99 246.2
You can use np.where to differentiate between waking hours and overnight hours
Creating sample data
data = {
'Date_Time': [
'2017-08-08 00:00:00', '2017-08-08 23:50:01', '2017-08-08 06:45:01',
'2017-08-08 06:00:00', '2017-08-08 00:35:01',
'2017-07-26 00:23:57', '2017-07-26 00:18:57', '2017-07-26 07:13:57',
'2017-07-26 00:08:57', '2017-07-26 07:03:57'
],
'Level': [239.0, 242.0, 246.0, 250.0, 254.0, 72.0, 67.0, 64.0, 64.0, 65.0]
}
df = pd.DataFrame(data, columns=['Date_Time', 'Level'])
df['Date_Time'] = pd.to_datetime(df['Date_Time'])
df = df.set_index('Date_Time')
print(df)
Level
Date_Time
2017-08-08 00:00:00 239.0
2017-08-08 23:50:01 242.0
2017-08-08 06:45:01 246.0
2017-08-08 06:00:00 250.0
2017-08-08 00:35:01 254.0
2017-07-26 00:23:57 72.0
2017-07-26 00:18:57 67.0
2017-07-26 07:13:57 64.0
2017-07-26 00:08:57 64.0
2017-07-26 07:03:57 65.0
Creating a mask of waking hours and overnight hours
mask = (df.index.time >= pd.to_datetime('00:00:00').time()) & (df.index.time <= pd.to_datetime('06:00:00').time())
df['Period'] = np.where(mask, '00:00:00 - 06:00:00', '06:00:01 - 23:59:59')
df
Level Period
Date_Time
2017-08-08 00:00:00 239.0 00:00:00 - 06:00:00
2017-08-08 23:50:01 242.0 06:00:01 - 23:59:59
2017-08-08 06:45:01 246.0 06:00:01 - 23:59:59
2017-08-08 06:00:00 250.0 00:00:00 - 06:00:00
2017-08-08 00:35:01 254.0 00:00:00 - 06:00:00
2017-07-26 00:23:57 72.0 00:00:00 - 06:00:00
2017-07-26 00:18:57 67.0 00:00:00 - 06:00:00
2017-07-26 07:13:57 64.0 06:00:01 - 23:59:59
2017-07-26 00:08:57 64.0 00:00:00 - 06:00:00
2017-07-26 07:03:57 65.0 06:00:01 - 23:59:59
Groupby the Date_Time and Period column and calculate average Level
result = df.groupby([df.index.date, 'Period'])['Level'].mean().reset_index()
result.columns = ['Date', 'Time', 'AvgLevel']
result
Date Time AvgLevel
0 2017-07-26 00:00:00 - 06:00:00 67.666667
1 2017-07-26 06:00:01 - 23:59:59 64.500000
2 2017-08-08 00:00:00 - 06:00:00 247.666667
3 2017-08-08 06:00:01 - 23:59:59 244.000000
by using Pandas freq option, mean, sum etc. can be calculated for an equal portion of time i.e. freq='H' for hourly calculation, freq='12H' for 12 hourly calculation, freq='D' for daily calculation and freq='BH' for business hoursly calculations.
Example is below:
avg_12_hours = df.groupby(pd.Grouper(freq='12H', key='Date_Time'))['Level'].mean()
Since, you are asking for a calculation period which is not equally splitted so, you need to do some custom calculations
ID START DATE END DATE
5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!
Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated
Try with groupby and agg
import pandas as pd
a = """5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)
# output. Longest Date will be where start date is min and end_date is max
ID START DATE END DATE
0 5188 2018-10-01 2018-11-30
1 5191 2019-02-28 2019-04-20
2 5191 2020-10-01 2020-11-20
3 5193 2017-02-08 2017-04-10
4 5193 2021-04-01 2021-05-15
5 5194 2019-05-15 2019-05-31
I have following data set in panda dataframe
print data
Result:
Open High Low Close Adj Close Volume
Date
2018-05-25 12.70 12.73 12.48 12.61 12.610000 1469800
2018-05-24 12.99 13.08 12.93 12.98 12.980000 814800
2018-05-23 13.19 13.30 13.06 13.12 13.120000 1417500
2018-05-22 13.46 13.57 13.25 13.27 13.270000 1189000
2018-05-18 13.41 13.44 13.36 13.38 13.380000 986300
2018-05-17 13.19 13.42 13.19 13.40 13.400000 1056200
2018-05-16 13.01 13.14 13.01 13.12 13.120000 481300
If I just want to print single column just close it shows with the date index
print data.Low
Result:
Date
2018-05-25 12.48
2018-05-24 12.93
2018-05-23 13.06
2018-05-22 13.25
2018-05-18 13.36
2018-05-17 13.19
2018-05-16 13.01
Is there way to slice/print just the closing price. So the output will be like:
12.48
12.93
13.06
13.25
13.36
13.19
13.01
In pandas Series and DataFrame always need some index values.
Default RangeIndex is possible create by:
print data.reset_index(drop=True).Low
But if need write only values to file as column without index and with no header:
data.Low.to_csv(file, index=False, header=None)
If need convert column to list:
print data.Low.tolist()
[12.48, 12.93, 13.06, 13.25, 13.36, 13.19, 13.01]
And for 1d numpy array:
print data.Low.values
[12.48 12.93 13.06 13.25 13.36 13.19 13.01]
If want 1xM array:
print (data[['Low']].values)
[[12.48]
[12.93]
[13.06]
[13.25]
[13.36]
[13.19]
[13.01]]
I'm new in python and my English are not so good so i ll try to explain my problem with the example below.
In :ds # is my dataframe
Out :DateStarted DateCompleted DayStarted DayCompleted \
1460 2017-06-12 14:03:32 2017-06-12 14:04:07 2017-06-12 2017-06-12
14445 2017-06-13 13:39:16 2017-06-13 13:40:32 2017-06-13 2017-06-13
14109 2017-06-21 10:25:36 2017-06-21 10:32:17 2017-06-21 2017-06-21
16652 2017-06-27 15:44:28 2017-06-27 15:44:41 2017-06-27 2017-06-27
30062 2017-07-05 09:49:01 2017-07-05 10:04:00 2017-07-05 2017-07-05
22357 2017-08-31 09:06:00 2017-08-31 09:10:31 2017-08-31 2017-08-31
39117 2017-09-08 08:43:07 2017-09-08 08:44:51 2017-09-08 2017-09-08
41903 2017-09-15 12:54:40 2017-09-15 14:00:06 2017-09-15 2017-09-15
74633 2017-09-27 12:41:09 2017-09-27 13:16:04 2017-09-27 2017-09-27
69315 2017-10-23 08:25:28 2017-10-23 08:26:09 2017-10-23 2017-10-23
87508 2017-10-30 12:19:19 2017-10-30 12:19:45 2017-10-30 2017-10-30
86828 2017-11-03 12:20:09 2017-11-03 12:24:56 2017-11-03 2017-11-03
89877 2017-11-06 13:52:05 2017-11-06 13:52:50 2017-11-06 2017-11-06
94970 2017-11-07 08:09:53 2017-11-07 08:10:15 2017-11-07 2017-11-07
94866 2017-11-28 14:38:14 2017-11-30 07:51:04 2017-11-28 2017-11-30
DailyTotalActiveTime diff
1460 NaN 35.0
14445 NaN 76.0
14109 NaN 401.0
16652 NaN 13.0
30062 NaN 899.0
22357 NaN 271.0
39117 NaN 104.0
41903 NaN 3926.0
74633 NaN 2095.0
69315 NaN 41.0
87508 NaN 26.0
86828 NaN 287.0
89877 NaN 45.0
94970 NaN 22.0
94866 NaN 148370.0
In the DailyTotalActiveTime column, i want to calculate how much time,
the specific days, will have in total. The diff column is in seconds.
I tried this, but i had no results:
for i in ds['diff']:
if i <= 86400:
ds['DailyTotalActiveTime']==i
else:
ds['DailyTotalActiveTime']==86400
ds['DailyTotalActiveTime']+1 == i-86400
What can i do? Again, sorry for the explanation..
You should try with = instead of ==
To get you halfway there, you could do something like the following (I am sure there must an a more simple way but I can't see it right now):
df['datestarted'] = pd.to_datetime(df['datestarted'])
df['datecompleted'] = pd.to_datetime(df['datecompleted'])
df['daystarted'] = df['datestarted'].dt.date
df['daycompleted'] = df['datecompleted'].dt.date
df['Date'] = df['daystarted'] # This is the unqiue date per row.
for row in df.itertuples():
if (row.daycompleted - row.daystarted) > pd.Timedelta(days=0):
for i in range(1, (row.daycompleted - row.daystarted).days+1):
df2 = pd.DataFrame([row]).drop('Index', axis=1)
df2['Date'] = df2['Date'] + pd.Timedelta(days=i)
df = df.append(df2)
I have a simple dataframe with typical OHLC values. I want to calculate daily 52 weeks high/low (or other time range) from it and put the result into a dataframe, so that I can track the daily movement of all record high/low.
For example, if the time range is just 3-day, the 3-day high/low would be:
(3-Day High: Maximum 'High' value in the last 3 days)
Out[21]:
Open High Low Close Volume 3-Day-High 3-Day-Low
Date
2015-07-01 273.6 273.6 273.6 273.6 0 273.6 273.6
2015-07-02 276.0 276.0 267.0 268.6 15808300 276.0 267.0
2015-07-03 268.8 269.0 256.6 259.8 20255200 276.0 256.6
2015-07-06 261.0 261.8 223.0 235.0 53285100 276.0 223.0
2015-07-07 237.2 237.8 218.4 222.0 38001700 269.0 218.4
2015-07-08 207.0 219.4 196.0 203.4 48558100 261.8 196.0
2015-07-09 207.4 233.8 204.2 233.6 37835900 237.8 196.0
2015-07-10 235.4 244.8 233.8 239.2 23299900 244.8 196.0
Is there any simple way to do it and how? Thanks guys!
You can use rolling_max and rolling_min:
>>> df["3-Day-High"] = pd.rolling_max(df.High, window=3, min_periods=1)
>>> df["3-Day-Low"] = pd.rolling_min(df.Low, window=3, min_periods=1)
>>> df
Open High Low Close Volume 3-Day-High 3-Day-Low
Date
2015-07-01 273.6 273.6 273.6 273.6 0 273.6 273.6
2015-07-02 276.0 276.0 267.0 268.6 15808300 276.0 267.0
2015-07-03 268.8 269.0 256.6 259.8 20255200 276.0 256.6
2015-07-06 261.0 261.8 223.0 235.0 53285100 276.0 223.0
2015-07-07 237.2 237.8 218.4 222.0 38001700 269.0 218.4
2015-07-08 207.0 219.4 196.0 203.4 48558100 261.8 196.0
2015-07-09 207.4 233.8 204.2 233.6 37835900 237.8 196.0
2015-07-10 235.4 244.8 233.8 239.2 23299900 244.8 196.0
Note that in agreement with your example, this uses the last three recorded days, regardless of the size of any gap between those rows (such as between 07-03 and 07-06).
The above method has been replaced in the latest versions of the python
Use this instead:
Series.rolling(min_periods=1, window=252, center=False).max()
You can try this:
three_days=df.index[-3:]
maxHigh=max(df['High'][three_days])
minLow=min(df['Low'][three_days])