DateTime filter in pandas

DateTime filter in pandas - python

I have a csv file like this and this is the code I wrote to filter the date
example['date_1'] = pd.to_datetime(example['date_1'])
example['date_2'] = pd.to_datetime(example['date_2'])
example
date_1 ID date_2
2015-01-12 111 2016-01-20 08:34:00
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
2015-01-11 1140303 2015-12-15 08:43:00
2015-01-11 1140414 2015-12-15 08:43:00
example[(example['date_1'] <= '2016-11-01')
& (example['date_1'] >= '2015-11-01')
& (example['date_2'] <= '2016-12-16')
& (example['date_2'] >= '2015-12-15')]
Output:
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
I don't understand why it changes the format of the date, and it seems like it mix up the month&day in the date, with the conditional filter, the expected result should be the same with the original dataset, but it erased several lines? Can someone help me with it, many thanks.

Some locales format the date as dd/mm/YYYY, while others use mm/dd/YYYY. By default pandas uses the american format of mm/dd/YYYY unless it can infer the alternate format from the values (when a day number is greater than 12...).
So if you know that you input date format is dd/mm/YYYY, you must say it to pandas:
example['date_1'] = pd.to_datetime(example['date_1'], dayfirst=True)
example['date_2'] = pd.to_datetime(example['date_2'], dayfirst=True)
Once pandas has a Timestamp column, it internally stores a number of nano seconds from 1970-01-01 00:00, and by default displays it according to ISO-8601, striping parts that are 0 for the columns. Parts being the full time, fractions of seconds or nanoseconds.
You should not care if you want to process the Timestamps. If at the end you want to force a format, explicitely change the column to its string representation:
df['date_1'] = df['date_1'].df.strftime('%d/%m/%Y %H:%M')

Related

Rounding up the value to the nearest hour

This is my first time to post a question here, if I don't explain the question very clearly, please give me a chance to improve the way of asking. Thank you!
I have a dataset contains dates and times like this
TIME COL1 COL2 COL3 ...
2018/12/31 23:50:23 34 DC 23
2018/12/31 23:50:23 32 NC 23
2018/12/31 23:50:19 12 AL 33
2018/12/31 23:50:19 56 CA 23
2018/12/31 23:50:19 98 CA 33
I want to create a new column and the format would be like '2018-12-31 11:00:00 PM' instead of '2018/12/31 23:10:23' and 17:40 was rounded up to 6:00
I have tried to use .dt.strftime("%Y-%m-%d %H:%M:%S") to change the format and then when I try to convert the time from 12h to 24h, I stuck here.
Name: TIME, Length: 3195450, dtype: datetime64[ns]
I found out the type of df['TIME'] is pandas.core.series.Series
Now I have no idea about how to continue. Please give me some ideas, hints or any instructions. Thank you very much!

From your example it seems you want to floor to the hour, instead of round? In any case, first make sure your TIME column is of datetime dtype.
df['TIME'] = pd.to_datetime(df['TIME'])
Now floor (or round) using the dt accessor and an offset alias:
df['newTIME'] = df['TIME'].dt.floor('H') # could use round instead of floor here
# df['newTIME']
# 0 2018-12-31 23:00:00
# 1 2018-12-31 23:00:00
# 2 2018-12-31 23:00:00
# 3 2018-12-31 23:00:00
# 4 2018-12-31 23:00:00
# Name: newTIME, dtype: datetime64[ns]
Afer that, you can format to string in a desired format, again using the dt accessor to access properties of a datetime series:
df['timestring'] = df['newTIME'].dt.strftime("%Y-%m-%d %I:%M:%S %p")
# df['timestring']
# 0 2018-12-31 11:00:00 PM
# 1 2018-12-31 11:00:00 PM
# 2 2018-12-31 11:00:00 PM
# 3 2018-12-31 11:00:00 PM
# 4 2018-12-31 11:00:00 PM
# Name: timestring, dtype: object

How do subtraction between timestamp two rows per two with shift - Pandas Python

I would like to make a subtraction with date_time in pandas python but with a shift of two rows, I don't know the function
Timestamp
2020-11-26 20:00:00
2020-11-26 21:00:00
2020-11-26 22:00:00
2020-11-26 23:30:00
Explanation:
(2020-11-26 21:00:00) - (2020-11-26 20:00:00)
(2020-11-26 23:30:00) - (2020-11-26 22:00:00)
The result must be:
01:00:00
01:30:00

Firstly you need to check if this is as type datetime.
If not, kindly do pd.to_datetime()
demo = pd.DataFrame(columns=['Timestamps'])
demotime = ['20:00:00','21:00:00','22:00:00','23:30:00']
demo['Timestamps'] = demotime
demo['Timestamps'] = pd.to_datetime(demo['Timestamps'])
Your dataframe would look like:
Timestamps
0 2020-11-29 20:00:00
1 2020-11-29 21:00:00
2 2020-11-29 22:00:00
3 2020-11-29 23:30:00
After that you can either use for loop or while and in that just do:
demo.iloc[i+1,0]-demo.iloc[i,0]

IIUC, you want to iterate on chunks of two and find the difference, one approach is to:
res = df.groupby(np.arange(len(df)) // 2).diff().dropna()
print(res)
Output
Timestamp
1 0 days 01:00:00
3 0 days 01:30:00

rolling function does not print all values [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I am trying to understand rolling function on pandas on python here is my example code
# importing pandas as pd
import pandas as pd
# By default the "date" column was in string format,
# we need to convert it into date-time format
# parse_dates =["date"], converts the "date" column to date-time format
# Resampling works with time-series data only
# so convert "date" column to index
# index_col ="date", makes "date" column
df = pd.read_csv("apple.csv", parse_dates = ["date"], index_col = "date")
print (df.close.rolling(3).sum())
print (df.close.rolling(3, win_type ='triang').sum())
cvs input file has 255 entries but I get few entries on the output, I get "..." between 2018-10-04 and 2017-12-26. I verified the input file, it has a lot more valid entries in between these dates.
date
2018-11-14 NaN
2018-11-13 NaN
2018-11-12 578.63
2018-11-09 590.87
2018-11-08 607.13
2018-11-07 622.91
2018-11-06 622.21
2018-11-05 615.31
2018-11-02 612.84
2018-11-01 631.29
2018-10-31 648.56
2018-10-30 654.38
2018-10-29 644.40
2018-10-26 641.84
2018-10-25 648.34
2018-10-24 651.19
2018-10-23 657.62
2018-10-22 658.47
2018-10-19 662.69
2018-10-18 655.98
2018-10-17 656.52
2018-10-16 659.36
2018-10-15 660.70
2018-10-12 661.62
2018-10-11 653.92
2018-10-10 652.92
2018-10-09 657.68
2018-10-08 667.00
2018-10-05 674.93
2018-10-04 676.05
...
2017-12-26 512.25
2017-12-22 516.18
2017-12-21 520.59
2017-12-20 524.37
2017-12-19 523.90
2017-12-18 525.31
2017-12-15 524.93
2017-12-14 522.61
2017-12-13 518.46
2017-12-12 516.19
2017-12-11 516.64
2017-12-08 513.74
2017-12-07 511.36
2017-12-06 507.70
2017-12-05 507.97
2017-12-04 508.45
2017-12-01 510.49
2017-11-30 512.70
2017-11-29 512.38
2017-11-28 514.40
2017-11-27 516.64
2017-11-24 522.13
2017-11-22 524.02
2017-11-21 523.07
2017-11-20 518.08
2017-11-17 513.27
2017-11-16 511.23
2017-11-15 510.33
2017-11-14 511.52
2017-11-13 514.39
Name: close, Length: 254, dtype: float64
thank you for your help ...

... just means that pandas isn't showing you all the rows, that's where the 'missing' ones are.
To display all rows:
with pd.option_context("display.max_rows", None):
print (df.close.rolling(3, win_type ='triang').sum())

Holiday Calendar in pandas DataFrame

I created a Holiday calendar for Germany (not all days included) as followed:
from pandas.tseries.holiday import Holiday,AbstractHolidayCalendar
class GermanHolidays(AbstractHolidayCalendar):
rules = [Holiday('New Years Day', month=1, day=1),
Holiday('First of May', month=5, day=1),
Holiday('German Unity Day', month=10,day=3),
...]
cal = GermanHolidays()
Now I want that a column displays when a holiday appears or not with ("1" or "0"). So I did the following:
holidays = cal.holidays(start=X['Time (CET)'].min(), end = X['Time (CET)'].max())
X['Holidays'] = X['Time (CET)'].isin(holidays)
X['Holidays'] = X['Holidays'].astype(float)
X is a dataframe where Time (CET) is column in the format %d.%m.%Y %H:%M:%S. Unfortunately this is not working. There is no error raised but all columns are marked with "0". So there is no matching happening and I really dont know why.
I thought that it is maybe because the frequency of holidays is daily and not hourly as it is in the column Time (CET).
Would be great if you could help me! Thank you!

There might be a few reasons for that.
One of them as mentioned by #unutbu - is a wrong (string) dtype. Make sure your X['Time (CET)'] column is of datetime dtype. This can be done as follows:
X['Time (CET)'] = pd.to_datetime(X['Time (CET)'], dayfirst=True, errors='coerce')
Another reason as you said is the time part.
Here is a demo:
In [28]: df = pd.DataFrame({'Date':pd.date_range('2017-01-01 01:01:01',
freq='9H', periods=1000)})
yields:
In [30]: df
Out[30]:
Date
0 2017-01-01 01:01:01
1 2017-01-01 10:01:01
2 2017-01-01 19:01:01
3 2017-01-02 04:01:01
4 2017-01-02 13:01:01
5 2017-01-02 22:01:01
6 2017-01-03 07:01:01
7 2017-01-03 16:01:01
8 2017-01-04 01:01:01
9 2017-01-04 10:01:01
.. ...
990 2018-01-07 07:01:01
991 2018-01-07 16:01:01
992 2018-01-08 01:01:01
993 2018-01-08 10:01:01
994 2018-01-08 19:01:01
995 2018-01-09 04:01:01
996 2018-01-09 13:01:01
997 2018-01-09 22:01:01
998 2018-01-10 07:01:01
999 2018-01-10 16:01:01
[1000 rows x 1 columns]
filtering by holidays isn't working because of not matching time part:
In [29]: df.loc[df.Date.isin(holidays)]
Out[29]:
Empty DataFrame
Columns: [Date]
Index: []
We can make it working by normalizing (truncate time part or set time to 00:00:00) our datetime column:
In [31]: df.loc[df.Date.dt.normalize().isin(holidays)]
Out[31]:
Date
0 2017-01-01 01:01:01
1 2017-01-01 10:01:01
2 2017-01-01 19:01:01
320 2017-05-01 01:01:01
321 2017-05-01 10:01:01
322 2017-05-01 19:01:01
734 2017-10-03 07:01:01
735 2017-10-03 16:01:01

This is basically what you already have. Given that this works and yours doesn't, it is likely because the values are text instead of timestamps as noted already by #unutbu and #MaxU.
Also, your post states:
displays when a holiday appears or not with ("1" or "0")
Did you really want a text value? You tried to convert to floats, but you probably just want integers.
X = pd.DataFrame({'Time (CET)': pd.DatetimeIndex(start='2017-01-01', end='2017-12-31', freq='12H')})
X = X.assign(Holidays=X['Time (CET)'].isin(cal.holidays()).astype(int))
>>> X
Time (CET) Holidays
0 2017-01-01 00:00:00 1
1 2017-01-01 12:00:00 0
2 2017-01-02 00:00:00 0
...

How to sum field across two DataFrames when the indexes don't line up?

I am brand new to complex data analysis in general, and to pandas in particular. I have a feeling that pandas should be able to handle this task easily, but my newbieness prevents me from seeing the path to a solution. I want to sum one column across two files at a given time each day, 3pm in this case. If a file doesn't have a record at 3pm that day, I want to use the previous record.
Let me give a concrete example. I have data in two CSV files. Here are a couple small examples:
datetime value
2013-02-28 09:30:00 0.565019720442
2013-03-01 09:30:00 0.549536266504
2013-03-04 09:30:00 0.5023031467
2013-03-05 09:30:00 0.698370467751
2013-03-06 09:30:00 0.75834927162
2013-03-07 09:30:00 0.783620442226
2013-03-11 09:30:00 0.777265379462
2013-03-12 09:30:00 0.785787872851
2013-03-13 09:30:00 0.784873183044
2013-03-14 10:15:00 0.802959366653
2013-03-15 10:15:00 0.802959366653
2013-03-18 10:15:00 0.805413095911
2013-03-19 09:30:00 0.80816233134
2013-03-20 10:15:00 0.878912249996
2013-03-21 10:15:00 0.986393922571
and the other:
datetime value
2013-02-28 05:00:00 0.0373634672097
2013-03-01 05:00:00 -0.24700085273
2013-03-04 05:00:00 -0.452964976056
2013-03-05 05:00:00 -0.2479288295
2013-03-06 05:00:00 -0.0326855588777
2013-03-07 05:00:00 0.0780461766619
2013-03-08 05:00:00 0.306247682656
2013-03-11 06:00:00 0.0194146154407
2013-03-12 05:30:00 0.0103653153719
2013-03-13 05:30:00 0.0350377752558
2013-03-14 05:30:00 0.0110884755383
2013-03-15 05:30:00 -0.173216846788
2013-03-19 05:30:00 -0.211785013352
2013-03-20 05:30:00 -0.891054563968
2013-03-21 05:30:00 -1.27207563599
2013-03-22 05:30:00 -1.28648629004
2013-03-25 05:30:00 -1.5459897419
Note that a) neither file actually has a 3pm record, and b) the two files don't always have records for any given day. (2013-03-08 is missing from the first file, while 2013-03-18 is missing from the second, and the first file ends before the second.) As output, I envision a dataframe like this (perhaps just the date without the time):
datetime value
2013-Feb-28 15:00:00 0.6023831876517
2013-Mar-01 15:00:00 0.302535413774
2013-Mar-04 15:00:00 0.049338170644
2013-Mar-05 15:00:00 0.450441638251
2013-Mar-06 15:00:00 0.7256637127423
2013-Mar-07 15:00:00 0.8616666188879
2013-Mar-08 15:00:00 0.306247682656
2013-Mar-11 15:00:00 0.7966799949027
2013-Mar-12 15:00:00 0.7961531882229
2013-Mar-13 15:00:00 0.8199109582998
2013-Mar-14 15:00:00 0.8140478421913
2013-Mar-15 15:00:00 0.629742519865
2013-Mar-18 15:00:00 0.805413095911
2013-Mar-19 15:00:00 0.596377317988
2013-Mar-20 15:00:00 -0.012142313972
2013-Mar-21 15:00:00 -0.285681713419
2013-Mar-22 15:00:00 -1.28648629004
2013-Mar-25 15:00:00 -1.5459897419
I have a feeling this is perhaps a three-liner in pandas, but it's not at all clear to me how to do this. Further complicating my thinking about this problem, more complex CSV files might have multiple records for a single day (same date, different times). It seems that I need to somehow either generate a new pair of input dataframes with times at 15:00 and then sum across their values columns keying on just the date, or during the sum operation select the record with the greatest time on any given day with the time <= 15:00:00. Given that datetime.time objects can't be compared for magnitude, I suspect I might have to group rows together having the same date, then within each group, select only the row nearest to (but not greater than) 3pm. Kind of at that point my brain explodes.
I got nowhere looking at the documentation, as I don't really understand all the database-like operations pandas supports. Pointers to relevant documentation (especially tutorials) would be much appreciated.

First combine your DataFrames:
df3 = df1.append(df2)
so that everything is in one table, next use the groupby to sum across timestamps:
df4 = df3.groupby('datetime').aggregate(sum)
now d4 has a value column that is the sum of all matching datetime columns.
Assuming you have the timestamps as datetime objects, you can do whatever filtering you like at any stage:
filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]
I'm not sure exactly what you are trying to do, you may need to parse your timestamp columns before filtering.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DateTime filter in pandas - python

Related

Rounding up the value to the nearest hour

How do subtraction between timestamp two rows per two with shift - Pandas Python

rolling function does not print all values [duplicate]

Holiday Calendar in pandas DataFrame

How to sum field across two DataFrames when the indexes don't line up?

Categories

Resources