python pandas String to TimeStramps convert ambigous - python

I'm trying to slice a Dataframe using DateTimeIndex, but a got one issue.
When the new DataFrame Change Month, he switch the day and the month.
Here is my dataframe:
Valeur
date
2015-01-08 00:00:00 93
2015-01-08 00:10:00 90
2015-01-08 00:20:00 88
2015-01-08 00:30:00 103
2015-01-08 00:40:00 86
2015-01-08 00:50:00 88
2015-01-08 01:00:00 86
2015-01-08 01:10:00 84
2015-01-08 01:20:00 95
2015-01-08 01:30:00 88
2015-01-08 01:40:00 85
2015-01-08 01:50:00 92
... ...
2016-10-30 22:20:00 98
2016-10-30 22:30:00 94
2016-10-30 22:40:00 94
2016-10-30 22:50:00 103
2016-10-30 23:00:00 92
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
[65814 rows x 1 columns]
Here my two TimeStamps:
startingDate : 2015-10-31 23:50:00
lastDate : 2016-10-30 23:50:00
When i slice my df like this :
dfconso = dfconso[startingDate:lastDate]
i got something like this :
Valeur
date
2015-10-31 23:50:00 88
2015-01-11 00:00:00 83
2015-01-11 00:10:00 82
2015-01-11 00:20:00 87
2015-01-11 00:30:00 77
2015-01-11 00:40:00 72
2015-01-11 00:50:00 86
2015-01-11 01:00:00 77
2015-01-11 01:10:00 80
... ...
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
The problem is the slice start at the good date, but when the DateTimeIndex change month, something wrong append.
Pass from 31 October 2015 to 11 January 2015.
And i don't understand why..
I try to print the month and day to see and i got that :
In:
print("Index 0 : month", dfconso.index[0].month, ", day", dfconso.index[0].day)
print("Index 1 : month", dfconso.index[1].month, ", day", dfconso.index[1].day)
Out:
Index 0 : month 10 , day 31
Index 1 : month 1 , day 11
If someone has an idea
EDIT :
After df.sort_index() my df, i can see the convert of String date to TimeStamps date, didn't work sometimes, and switch Month and Day.
Format at String :
"31/08/2015 20:00:00"
My code to transform from String to TimeStamps:
dfconso.index = pd.to_datetime(dfconso.index, infer_datetime_format=True, format="%d/%m/%Y")

SOLUTION :
that was a bad use of pd.to_datetime, i change infer_date_time_format to Dayfirst :
dfconso.index = pd.to_datetime(dfconso.index, dayfirst=True)
That solve my problem.

The error might not be a mixup of day and month, but just an ordering problem. Try reordering the data before slicing it (the provided part of your data looks fine, but who knows about the rest..).
Here is how reordering works: Sort a pandas datetime index

Related

Filtering dataframe given a list of dates

I have the following dataframe:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
3 1999-10-05 12:00:00 53
4 1999-10-10 16:00:00 43
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
I have a datetime list that I get from tolist() in another dataframe.
[Timestamp('1999-10-01 00:00:00'),
Timestamp('1999-10-02 00:00:00'),
Timestamp('1999-10-24 00:00:00')]
The tolist() purpose is to filter the dataframe based on the dates inside the list. The end result is:
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21
Where only 1st, 2nd and 24th Oct rows will appear in the dataframe.
What is the approach to do this? I have looked up and only see solution to filter between dates or a singular date.
Thank you.
If want compare Timestamp without times use Series.dt.normalize:
df1 = df[df['Date'].dt.normalize().isin(L)]
Or Series.dt.floor :
df1 = df[df['Date'].dt.floor('d').isin(L)]
For compare by dates is necessary convert also list to dates:
df1 = df[df['Date'].dt.date.isin([x.date for x in L])]
print (df1)
Date Site
0 1999-10-01 12:00:00 65
1 1999-10-01 16:00:00 21
2 1999-10-02 11:00:00 57
5 1999-10-24 07:00:00 33
6 1999-10-24 08:00:00 21

How to make a time index in dataframe pandas with 15 minutes spacing

How to make a time index in dataframe pandas with 15 minutes spacing for 24 hours with out the date format (12\4\2020 00:15)or doing it manually?
example that I only want is 00:15 00:30 00:45.........23:45 00:00 as an index.
You can use pd.date_range to create dummy dates with your desired time frequency, then just extract them:
pd.Series(pd.date_range(
'1/1/2020', '1/2/2020', freq='15min', closed='left')).dt.time
0 00:00:00
1 00:15:00
2 00:30:00
3 00:45:00
4 01:00:00
...
91 22:45:00
92 23:00:00
93 23:15:00
94 23:30:00
95 23:45:00
Length: 96, dtype: object
You can use to_timedelta with an array of numbers, here I chose minutes.
pd.to_timedelta(np.arange(0, 24*60, 15), unit='min')
#TimedeltaIndex(['00:00:00', '00:15:00', '00:30:00', '00:45:00', '01:00:00',
# ....
# '23:45:00'],
# dtype='timedelta64[ns]', freq=None)

How do I group hourly data by day and count only values greater than a set amount in Pandas?

I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0

pandas quick way to break dataframe by timeindex

I have a dataframe with a timeindex. But the timeindex is not consecutive.
df with microsecond resolution timestamp index.
Time Bid
2014-03-03 23:30:30.383002 1.37315
2014-03-03 23:30:30.383042 1.37318
2014-03-03 23:30:30.383067 1.37318
2014-03-03 23:30:31.174442 1.37315
2014-03-03 23:30:32.028966 1.37315
2014-03-03 23:30:32.052447 1.37315
I want to check if there is minute without any data, so I did resample
tick_count = e.resample('1Min', how=np.size)
Time Bid
2014-03-04 00:15:00 73
2014-03-04 00:16:00 298
2014-03-04 00:17:00 124
2014-03-04 00:18:00 318
2014-03-04 00:19:00 27
2014-03-04 00:20:00 0
2014-03-04 00:21:00 0
2014-03-04 00:22:00 241
2014-03-04 00:23:00 97
2014-03-04 00:24:00 52
2014-03-04 00:25:00 446
2014-03-04 00:26:00 867
so here I find two minutes with no data, how do I separate the original df into multiple
df and each of them has data every minute. In the case above
this first df will start from 00:15 to 00:19, second one starts from 00:22 to 00:26, etc.
Thank you!
Assuming the times are sorted, you could use
df['group'] = (df['Time'].diff() > np.timedelta64(60,'s')).cumsum()
to add a column to your DataFrame, which will classify the rows according to which group they belong to. The result looks like this:
Time Bid group
0 2014-03-04 00:15:00 73 0
1 2014-03-04 00:16:00 298 0
2 2014-03-04 00:17:00 124 0
3 2014-03-04 00:18:00 318 0
4 2014-03-04 00:19:00 27 0
5 2014-03-04 00:22:00 241 1
6 2014-03-04 00:23:00 97 1
7 2014-03-04 00:24:00 52 1
8 2014-03-04 00:25:00 446 1
9 2014-03-04 00:26:00 867 1
This is better than having multiple DataFrames, because you can apply fast numpy/pandas operations to the entire DataFrame whereas, if you had a list of DataFrames you would be forced to use a Python loop to operate on the sub-DataFrames individually (assuming you want to perform the same operation on each sub-DataFrame). Doing so is generally always slower.
Typically, the pandas-way to operate on the sub-DataFrames would be to use a groupby operation. For example,
>>> grouped = df.groupby(['group'])
>>> grouped['Bid'].sum()
group
0 840
1 1703
Name: Bid, dtype: int64
to find the sum of the bids in each group.
However, if you really wish to have a list of sub-DataFrames, you could obtain it using
subdfs = [subdf for key, subdf in grouped]
For those wanting to reproduce the result above, I put this in a file called data:
Time Bid
2014-03-04 00:15:00 73
2014-03-04 00:16:00 298
2014-03-04 00:17:00 124
2014-03-04 00:18:00 318
2014-03-04 00:19:00 27
2014-03-04 00:22:00 241
2014-03-04 00:23:00 97
2014-03-04 00:24:00 52
2014-03-04 00:25:00 446
2014-03-04 00:26:00 867
and ran
import pandas as pd
import numpy as np
df = pd.read_table('data', sep='\s{2,}', parse_dates=[0])
print(df.dtypes)
# Time datetime64[ns] # It is important that Time has dtype datetime64[ns]
# Bid int64
# dtype: object
df['group'] = (df['Time'].diff() > np.timedelta64(60,'s')).cumsum()
print(df)

matplotlib plots strange horizontal lines on graph

I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil
Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.
I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot

Categories

Resources