I have 3 columns in the dataset to which I wanna add dates
Date
temperature
humidity
2015-01-01 00:00:00
5.9
NA
2015-01-01 01:00:00
5.5
NA
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
I wanna add 2 months like from 1st may to 31 july to Date column
with hour implementation it will be smth like this
Date
temperature
humidity
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
2015-05-01 00:00:00
..
NA
2015-05-01 01:00:00
..
NA
⋮
⋮
⋮
until i get to
Date
temperature
humidity
⋮
⋮
⋮
2015-07-31 23:00:00
..
NA
I've tried
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
print(date)
is there an easier way to do it?
Well, you have a start with the iteration from datetime.datetime and datetime.timedelta
You need to accumulate those in a list, rather than printing them
listofdates=[]
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
listofdates.append(date)
And then, create a new dataframe from the existing one (let's call it df) and this list of dates. Do do so, you can use pd.concat that creates a dataframe from two dataframes.
So, you need a dataframe from your new list of dates. With the same column name
newlines = pd.DataFrame({'Date':listofdates})
Which gives
Date
2015-04-01 00:00:00
⋮
2015-04-02 07:00:00
(Note that it starts at 4/1 00:00, not 3/31 23:00, because you add the timedelta beform appending)
We can concatenate your dataframe with this one (missing columns will be filled with NA) like this
newdf = pd.concat([df, newlines])
Last remarks that I kept for the end to avoid confusion:
I would have stored the timedelta once for all, rather than creating one each time (it is not that expansive, but still)
So, altogether
date=datetime.datetime(2015,3,31,23,0,0)
dt=datetime.timedelta(hours=1)
listofdates=[]
for i in range(32):
date += dt
listofdates.append(date)
newlines = pd.DataFrame({'Date':listofdates})
newdf = pd.concat([df, newlines])
For this kind of usage, you can build the list directly using compound lists
listofdates=[date+k*dt for k in range(1,33)]
Or using numpy
listofdates=date+np.arange(1,33)*dt
Which allows for one-liner
newdf = pd.concat([df, pd.DataFrame({'Date':date+np.arange(1,33)*dt})])
But don't try to understand this one before you understood the longer version previously described
Related
I have massive data from CSV which spans every hour for a whole year. It has not been difficult plotting the whole data (or specific data) through the whole year.
However, I would like to take a closer look at month (for ex just plot January or February), and for the life of me, I haven't found out how to do that.
Date Company1 Company2
2020-01-01 00:00:00 100 200
2020-01-01 01:00:00 110 180
2020-01-01 02:00:00 90 210
2020-01-01 03:00:00 100 200
.... ... ...
2020-12-31 21:00:00 100 200
2020-12-31 22:00:00 80 230
2020-12-31 23:00:00 120 220
All of the columns are correctly formatted, the datetime is correctly formatted. How can I slice or define exactly the period I want to plot?
You can extract the month portion of a pandas datetime using .dt.month on a datetime series. Then check if that is equal to the month in question:
df_january = df[df['Date'].dt.month == 1]
You can then plot using your df_january dataframe. N.B. this will pick up data from other years as well if your dataset expanded to cover other years.
#WakemeUpNow had the solution I hadn't noticed. defining xlin while plotting did the trick.
df.DateTime.plot(x='Date', y='Company', xlim=('2020-01-01 00:00:00 ', '2020-12-31 23:00:00'))
plt.show()
I am trying to return rows from pnr_df that data_df's dates fall in between. For example, the first date in data_df is 2022-09-01 and the airport is BDL. I would like to return the row in pnr_df that matches the airport name, BDL, and the data_df's date falls in between date_from and date_to dates.
data_df
index,airport,unit price,date
**0,BDL ,3.8281,2022-09-01 00:00:00
1,BDL ,3.8281,2022-09-01 00:00:00
2,BDL ,3.8281,2022-09-01 00:00:00**
3,MSY ,3.7206,2022-09-01 00:00:00
4,PIT ,3.7426,2022-09-01 00:00:00
5,PIT ,3.7426,2022-09-01 00:00:00
6,PIT ,3.7426,2022-09-01 00:00:00
pnr_df
index,PNR airport,PNR unit price,date from,date to
**0,BDL,3.8281,2022-08-30 00:00:00,2022-09-05 00:00:00**
1,BNA,3.51161,2022-08-01 00:00:00,2022-08-31 00:00:00
2,CAK,3.7386,2022-08-30 00:00:00,2022-09-05 00:00:00
3,CHS,3.49361,2022-08-01 00:00:00,2022-08-31 00:00:00
4,CMH,3.6256,2022-08-30 00:00:00,2022-09-05 00:00:00
5,HPN,3.7654,2022-08-30 00:00:00,2022-09-05 00:00:00
I have tried to merge the two tables, but that does not solve my date problem. I guess I need to find a way to check the airport names in the pnr_df and if the pnr_df dates fall in between those dates.
I'm trying to print a dataframe with datetimes corresponding to the 2/29/2020 date omitted in Jupyter. When I typed in the conditional statement on the top cell in the picture linked below and outputted the dataframe onto the bottom cell with all of the datetimes after 2/28/2020 22:00:00, only the dataframe row corresponding to just the first hour of the day (2/29/2020 00:00:00) was omitted and not the dataframe rows corresponding to the 2/29/2020 01:00:00 -> 2/29/2020 23:00:00 datetimes like I wanted. How can I change the conditional statement on the top cell which will make it so that all of the datetimes for 2/29/2020 will disappear?
To omit all datetimes of 2/29/2020, you need to first convert the datetimes to dates in your comparison.
Change:
post_retrofit[post_retrofit['Unit Datetime'] != date(2020, 2, 29)]
To:
post_retrofit[post_retrofit['Unit Datetime'].dt.date != datetime(2020, 2, 29).date()]
Your question not clear.
Lets assume I have the following;
Data
post_retrofit_without_2_29=pd.DataFrame({'Unit Datetime':['2020-02-28 23:00:00','2020-02-28 22:00:00','2020-02-29 22:00:00']})
print(post_retrofit_without_2_29)
Unit Datetime
0 2020-02-28 23:00:00
1 2020-02-28 22:00:00
2 2020-02-29 22:00:00
Solution
To filter out by date, I have to coerce the datetime to date as follows;
post_retrofit_without_2_29['Unit Date']=pd.to_datetime(post_retrofit_without_2_29['Unit Datetime']).dt.strftime("%y-%m-%d")
print(post_retrofit_without_2_29)
Unit Datetime Unit Date
0 2020-02-28 23:00:00 20-02-28
1 2020-02-28 22:00:00 20-02-28
2 2020-02-29 22:00:00 20-02-29
Filter
post_retrofit_without_2_29[post_retrofit_without_2_29['Unit Date']>'20-02-28']
Unit Datetime Unit Date
2 2020-02-29 22:00:00 20-02-29
You can do this easily by creating a pivot_table() with the dates as indexes. As a result of which, there will be no problem with the time.
post_retrofit_without_2_29_pivot = pd.pivot_table(data=post_retrofit_without_2_29 , index= post_retrofit_without_2_29['Unit Datetime'])
post_retrofit_without_2_29_pivot.loc[hourly_pivot.index != pd.to_datetime("2020-02-29") ]
I know this is a bit lengthy, but its simple to understand.
Hope, you got some help with this answer :}
I have the following column
Time
2:00
00:13
1:00
00:24
in object format (strings). This time refers to hours and minutes ago from a time that I need to use as a start: 8:00 (it might change; in this example is 8:00).
Since the times in the column Time are referring to hours/minutes ago, what I would like to expect should be
Time
6:00
07:47
7:00
07:36
calculated as time difference (e.g. 8:00 - 2:00).
However, I am having difficulties in doing this calculation and transform the result in a datetime (keeping only hours and minutes).
I hope you can help me.
Since the Time columns contains only Hour:Minute I suggest using timedelta instead of datetime:
df['Time'] = pd.to_timedelta(df.Time+':00')
df['Start_Time'] = pd.to_timedelta('8:00:00') - df['Time']
Output:
Time Start_Time
0 02:00:00 06:00:00
1 00:13:00 07:47:00
2 01:00:00 07:00:00
3 00:24:00 07:36:00
you can do it using pd.to_datetime.
ref = pd.to_datetime('08:00') #here define the hour of reference
s = ref-pd.to_datetime(df['Time'])
print (s)
0 06:00:00
1 07:47:00
2 07:00:00
3 07:36:00
Name: Time, dtype: timedelta64[ns]
This return a series, that can be change to a dataframe with s.to_frame() for example
I have a DataFrame like this:
Date X
....
2014-01-02 07:00:00 16
2014-01-02 07:15:00 20
2014-01-02 07:30:00 21
2014-01-02 07:45:00 33
2014-01-02 08:00:00 22
....
2014-01-02 23:45:00 0
....
1)
So my "Date" Column is a datetime and has values vor every 15min of a day.
What i want is to remove ALL Rows where the time is NOT between 08:00 and 18:00 o'clock.
2)
Some days are missing in the datas...how could i put the missing days in my dataframe and fill them with the value 0 as X.
My approach: Create a new Series between two Dates and set 15min as frequenz and concat my X Column with the new created Series. Is that right?
Edit:
Problem for my second Question:
#create new full DF without missing dates and reindex
full_range = pandas.date_range(start='2014-01-02', end='2017-11-
14',freq='15min')
df = df.reindex(full_range,fill_value=0)
df.head()
Output:
Date X
2014-01-02 00:00:00 1970-01-01 0
2014-01-02 00:15:00 1970-01-01 0
2014-01-02 00:30:00 1970-01-01 0
2014-01-02 00:45:00 1970-01-01 0
2014-01-02 01:00:00 1970-01-01 0
That didnt work as you see.
The "Date" Column is not a index btw. i need it as Column in my df
and why did he take "1970-01-01"? 1970 as year makes no sense to me
What I want is to remove ALL Rows where the time is NOT between 08:00
and 18:00 o'clock.
Create a mask with datetime.time. Example:
from datetime import time
idx = pd.date_range('2014-01-02', freq='15min', periods=10000)
df = pd.DataFrame({'x': np.empty(idx.shape[0])}, index=idx)
t1 = time(8); t2 = time(18)
times = df.index.time
mask = (times > t1) & (times < t2)
df = df.loc[mask]
Some days are missing in the data...how could I put the missing days
in my DataFrame and fill them with the value 0 as X?
Build a date range that doesn't have missing data with pd.date_range() (see above).
Call reindex() on df and specify fill_value=0.
Answering your questions in comments:
np.empty creates an empty array. I was just using it to build some "example" data that is basically garbage. Here idx.shape is the shape of your index (length, width), a tuple. So np.empty(idx.shape[0]) creates an empty 1d array with the same length as idx.
times = df.index.time creates a variable (a NumPy array) called times. df.index.time is the time for each element in the index of df. You can explore this yourself by just breaking the code down in pieces and experimenting with it on your own.