Plotting time stamp column versus another column in Python - python

I am .xls file and i have column of timestamp with following format of timestamp
2018-04-01 00:01:45
2018-04-01 00:16:45
2018-04-01 00:31:46
2018-04-01 00:46:45
2018-04-01 01:01:46
2018-04-01 01:16:45
2018-04-01 01:31:50
2018-04-01 01:46:45
2018-04-01 02:01:46
I Have another column with in same .xls file by name of temperature with following format
34
34
34
33
33
33
33
33
33
33
33
33
33
33
33
33
33
33
33
I want to plot values versus time . I tried to plot it but i am having issues in plotting it as it is not correctly reading the timestamp
My code :
#changing timestamp data from object to datatype
w = df['Timestamp']
// column name "Timestamp" was creating issue so i have to remove it"
w=w.drop(w.index[0])
//converting timestamp type object to datetime
w = pd.to_datetime(w)
area = (12 * np.random.rand(N))**2 # 0 to 15 point radii
plt.xlabel('Temperature')
plt.ylabel('DateTime')
plt.title('Temperature and DateTime Relation')
plt.scatter(t, w, s=area, c='purple', alpha=0.5)
plt.show()
Its Giving me error "TypeError: invalid type promotion"

I believe need to_datetime with parameter format first and then for 15Min data add resample with some function like mean:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M')
s = df.resample('15Min', on='date')['temp'].mean()
s.plot()

Related

strftime is not recognizing the real datetime

I have a dataframe like this:
df = pd.DataFrame({"DateTime":["26/06/2014 22:05:16",
"25/06/2014 22:05:56",
"01/07/2014 22:05:30",
"01/08/2014 19:04:23"],
"Data":[20, 31, 25, 44]})
df
Out[9]:
DateTime Data
0 26/06/2014 22:05:16 20
1 25/06/2014 22:05:56 31
2 01/07/2014 22:05:30 25
3 01/08/2014 19:04:23 44
I would like to convert my DateTime column to datetime64 and specify a format. The original data is like DAY/MONTH/YEAR and then I would like to put them as YEAR-MONTH-DAY. I tried this:
df["DateTime"] = pd.to_datetime(df["DateTime"])
df["DateTime"] = df["DateTime"].dt.strftime('%Y-%m-%d %H:%M:%S')
df
Out[11]:
DateTime Data
0 2014-06-26 22:05:16 20
1 2014-06-25 22:05:56 31
2 2014-01-07 22:05:30 25
3 2014-01-08 19:04:23 44
The first two dates are ok, although the last two didn't convert correctly. The month became day...it should be like this:
2 2014-07-01 22:05:30 25
3 2014-08-01 19:04:23 44
Anyone could show me the correct way to convert this datetime?
The default format for pd.to_datetime is MM/DD. Since your data is DD/MM, you should tell to_datetime to parse day first with dayfirst=True:
df['DateTime'] = pd.to_datetime(df["DateTime"], dayfirst=True).dt.strftime('%Y-%m-%d %H:%M:%S')
in converting datetime, specify dayfirst as True
df["DateTime"] = pd.to_datetime(df["DateTime"], dayfirst=True)
df["DateTime"] = df["DateTime"].dt.strftime('%Y-%m-%d %H:%M:%S')
df
DateTime Data
0 2014-06-26 22:05:16 20
1 2014-06-25 22:05:56 31
2 2014-07-01 22:05:30 25
3 2014-08-01 19:04:23 44

Find missing hours from one column and save as txt file in Python

Given a dataset as follows:
date
NO2
SO2
O3
0
2018/11/14 10:00
9
25
80
1
2018/11/14 12:00
9
26
88
2
2018/11/14 13:00
8
26
88
3
2018/11/14 14:00
8
34
88
4
2018/11/14 15:00
8
37
89
5
2018/11/14 17:00
8
72
40
6
2018/11/14 18:00
8
56
50
7
2018/11/14 19:00
7
81
22
I would like to find missing hours from date column, and save these missing date as missing_date.txt.
My code:
df = df.set_index(pd.to_datetime(df['date']))
df = df.sort_index()
df = df.drop(columns=['date'])
df = df.resample('H').first().fillna(np.nan)
missing = df[df['NO2'].isnull()]
np.savetxt('./missing_date.txt', missing.index.to_series(), fmt='%s')
Out:
2018-11-14T11:00:00.000000000
2018-11-14T16:00:00.000000000
The problem:
not concise, maybe need to improve;
date format is not expected as follows: 2018/11/14 11:00, 2018/11/14 16:00.
How could I improve the code above? Thanks.
Use DataFrame.asfreq working with unique datetimes:
#create sorted DatetimeIndex
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
#if possible duplicates
#df = df.resample('H').first()
#if not duplicates
df = df.asfreq('H')
missing = df[df['NO2'].isna()]
For write to file is possible convert values of DatetimeIndex for custom format first by DatetimeIndex.strftime and then write by numpy or pandas:
s = missing.index.strftime('%Y/%m/%d %H:%M').to_series()
np.savetxt('./missing_date.txt', s, fmt='%s')
s.to_csv('./missing_date.txt', index=False)

how to plot only with the dates inside my df and not all the dates

I have this following df :
date values
2020-08-06 08:00:00 5
2020-08-06 09:00:00 10
2020-08-06 10:00:00 0
2020-08-17 08:00:00 8
2020-08-17 09:00:00 15
I want to plot this df so I do : df.set_index('date')['values'].plot(kind='line') but it shows all the dates between the 6th and the 17th.
How can I plot the graph only with the dates inside my df ?
I assume that date column is of datetime type.
To draw for selected dates only, the index must be built on
the principle "number of day from a unique list + hour".
But to suppress the default x label ticks, you have to define
your own, e.g. each 8 h in each date to be drawn.
Start from converting your DataFrame as follows:
idx = df['date'].dt.normalize().unique()
dateMap = pd.Series(np.arange(idx.size) * 24, index=idx)
df.set_index(df.date.dt.date.map(dateMap) + df.date.dt.hour, inplace=True)
df.index.rename('HourNo', inplace=True); df
Now, for your data sample, it has the following content:
date values
HourNo
8 2020-08-06 08:00:00 5
9 2020-08-06 09:00:00 10
10 2020-08-06 10:00:00 0
32 2020-08-17 08:00:00 8
33 2020-08-17 09:00:00 15
Then generate your plot and x ticks positions and labels:
fig, ax = plt.subplots(tight_layout=True)
df.loc[:, 'values'].plot(style='o-', rot=30, ax=ax)
xLoc = np.arange(0, dateMap.index.size * 24, 8)
xLbl = pd.concat([ pd.Series(d + pd.timedelta_range(start=0, freq='8H',
periods=3)) for d in dateMap.index ]).dt.strftime('%Y-%m-%d\n%H:%M')
plt.xticks(ticks=xLoc, labels=xLbl, ha='right')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.set_title('Set the proper heading')
ax.grid()
plt.show()
I added also the grid.
The result is:
And the final remark: Avoid column names which are the same as existing
Pandas methods or arrtibutes (e.g. values).
Sometimes it is the cause of "stupid" errors (you intend to refer to
a column, but you actually refer to a metod or attribute).

Resample dataframe based on time ranges, ignoring date

I am trying to resample my data to get sums. This resampling needs to be based solely on time. I want to group the times in 6 hours, so regardless of the date I will get 4 sums.
My df looks like this:
booking_count
date_time
2013-04-04 08:32:25 58
2013-04-04 18:43:11 1
2013-30-04 12:39:15 52
2013-14-05 06:51:33 99
2013-01-06 23:59:17 1
2013-03-06 19:37:25 42
2013-27-06 04:12:01 38
With this example data, I expect the get the following results:
00:00:00 38
06:00:00 157
12:00:00 52
18:00:00 43
To get around the date issue, I tried to keep only the time values:
df['time'] = pd.DatetimeIndex(df['date_time']).time
new_df = df[['time', 'booking_bool']].set_index('time').resample('360min').sum()
Unfortunately, this was to no avail. How do I go about getting my required results? Is resample() even suitable for this task?
I don't think resample() is a good method to do this because you need to groupby based on hours independently of the day. Maybe you can try using cut using a custom bins parameter, and then a usual groupby
bins = np.arange(start=0, stop=24+6, step=6)
group = df.groupby(pd.cut(
df.index.hour,
bins, right=False,
labels=pd.date_range('00:00:00', '18:00:00', freq='6H').time)
).sum()
group
# booking_count
# 00:00:00 38
# 06:00:00 157
# 12:00:00 52
# 18:00:00 44

How to make a histogram of pandas datetimes per specific time interval?

I want to plot some datetimes and would like to specify a time interval in order to bundle them together and make a histogram. So for example, if there happen to be n datetimes in the span of one hour, group them together or parse them as year, month, day, hour. And omit minutes and seconds.
Let's say I have a data frame with some datetime values:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
print(df)
2018-06-19 17:10:32.076646 29
2018-06-20 17:10:32.076646 56
2018-06-21 17:10:32.076646 82
2018-06-22 17:10:32.076646 13
2018-06-23 17:10:32.076646 35
2018-06-24 17:10:32.076646 53
2018-06-25 17:10:32.076646 25
2018-06-26 17:10:32.076646 23
Ideally, I would like to specify a more flexible time interval, such as "6 hours" in order to make some sort of modulo operation on the datetimes. Is this possible?
pd.Grouper
Allows you to specify regular frequency intervals with which you will group your data. Use groupby to then aggregate your df based on these groups. For instance, if col2 was counts and you wanted to bin together all of the counts over 2 day intervals, you could do:
import pandas as pd
df.groupby(pd.Grouper(level=0, freq='2D')).col2.sum()
Outputs:
test
2018-06-19 13:49:11.560185 85
2018-06-21 13:49:11.560185 95
2018-06-23 13:49:11.560185 88
2018-06-25 13:49:11.560185 48
Name: col2, dtype: int32
You group by level=0, that is your index labeled 'test' and sum col2 over 2 day bins. The behavior of pd.Grouper can be a little annoying since in this example the bins start and end at 13:49:11..., which likely isn't what you want.
pd.cut + pd.date_range
You have a bit more control over defining your bins if you define them with pd.date_range and then use pd.cut. Here for instance, you can define bins every 2 days beginning on the 19th.
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-27', freq='2D'))).col2.sum()
Outputs:
(2018-06-19, 2018-06-21] 85
(2018-06-21, 2018-06-23] 95
(2018-06-23, 2018-06-25] 88
(2018-06-25, 2018-06-27] 48
Name: col2, dtype: int32
This is nice, because if you instead wanted the bins to begin on even days you can just change the start and end dates in pd.date_range
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-18', '2018-06-28', freq='2D'))).col2.sum()
Outputs:
(2018-06-18, 2018-06-20] 29
(2018-06-20, 2018-06-22] 138
(2018-06-22, 2018-06-24] 48
(2018-06-24, 2018-06-26] 78
(2018-06-26, 2018-06-28] 23
Name: col2, dtype: int32
If you really wanted to, you could specify 2.6 hour bins beginning on June 19th 2018 at 5 AM:
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19 5:00:00', '2018-06-28 5:00:00', freq='2.6H'))).col2.sum()
#(2018-06-19 05:00:00, 2018-06-19 07:36:00] 0
#(2018-06-19 07:36:00, 2018-06-19 10:12:00] 0
#(2018-06-19 10:12:00, 2018-06-19 12:48:00] 0
#(2018-06-19 12:48:00, 2018-06-19 15:24:00] 29
#....
Histogram.
Just use .plot(kind='bar') after you have aggregated the data.
(df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-28', freq='2D')))
.col2.sum().plot(kind='bar', color='firebrick', rot=30))

Categories

Resources