Indexing datetime column in pandas - python

I imported a csv file in python. Then, I changed the first column to datetime format.
datetime Bid32 Ask32
2019-01-01 22:06:11.699 1.14587 1.14727
2019-01-01 22:06:12.634 1.14567 1.14707
2019-01-01 22:06:13.091 1.14507 1.14647
I saw three ways for indexing first column.
df.index = df.datetime
del datetime
or
df.set_index('datetime', inplace=True)
and
df.set_index(pd.DatetimeIndex('datetime'), inplace=True)
My question is about the second and third ways. Why in some sources they used pd.DatetimeIndex() with df.set_index() (like third code) while the second code was enough?

In case you are not changing the 'datetime' column with to_datetime():
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index('datetime', inplace=True) # option 2
print(type(df.index))
Result:
pandas.core.indexes.base.Index
vs.
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index(pd.DatetimeIndex(df['datetime']), inplace=True) # option 3
print(type(df.index))
Result:
pandas.core.indexes.datetimes.DatetimeIndex
So the third one with pd.DatetimeIndex() makes it an actual datetime index, which is what you want.
Documentation:
pandas.Index
pandas.DatetimeIndex

Related

How to convert a column in a dataframe to an index datetime object?

I have a question about how to convert a column 'Timestamp' into an index&datetime. And then also drop the column once it's converted into an index.
df = {'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM],
'Value':['14','178','158','75']}
I tried the following, but obvious didn't work.
df.Timestamp = pd.to_datetime(df.Timestamp.str[0])
df=df.set_index(['Timestamp'], drop=True)
FYI. The df is actually a lot text processing so unfortunately I cannot just do read_csv and parse datetime object. :( So yes, the df is exactly as what's prescribed above.
Thank you.
Don't enclose 'Timestamp' in square brackets.
import pandas as pd
df = pd.DataFrame({'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM'],
'Value':['14','178','158']})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp')
print(df)
## Output
Value
Timestamp
20/01/2021 01:00:00.12 AM 14
20/01/2021 01:00:00.21 AM 178
20/01/2021 01:00:01.34 AM 158

Filter particular date in a DF column

I want to filter particular date in a DF column.
My code:
df
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
It is showing no match.
Note: df column is having time also with date which need to be trimmed.
Thanks in advance.
The following script assumes that the 'Crawl Dates' column contains strings:
import pandas as pd
import datetime
column_names = ["Crawl Date"]
df = pd.DataFrame(columns = column_names)
#Populate dataframe with dates
df.loc[0] = ['03-21-2020 23:45:57']
df.loc[1] = ['03-22-2020 23:12:33']
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
Then df returns:
Crawl Date 0 2020-03-21

Find the same date from two sets of data

I am new to Python. I got two sets of data shown as below.
Set 1:
Gmt time,Open,High,Low,Close,Volume,RSI,,Change,Gain,Loss,Avg Gain,Avg Loss,RS
15.06.2017 00:00:00.000,0.75892,0.76313,0.7568,0.75858,107799.5406,0,,,,,,,
16.06.2017 00:00:00.000,0.75857,0.76294,0.75759,0.76202,94367.4299,0,,0.00344,0.00344,0,,,
18.06.2017 00:00:00.000,0.76202,0.76236,0.76152,0.76188,5926.0998,0,,-0.00014,0,0.00014,,,
19.06.2017 00:00:00.000,0.76189,0.76289,0.75848,0.75902,87514.849,0,,-0.00286,0,0.00286,,,
...
Set 2:
Gmt time,Open,High,Low,Close,Volume
15.06.2017 00:00:00.000,0.75892,0.75933,0.75859,0.75883,4777.4702
15.06.2017 01:00:00.000,0.75885,0.76313,0.75833,0.76207,7452.5601
15.06.2017 02:00:00.000,0.76207,0.76214,0.76106,0.76143,4798.4102
15.06.2017 03:00:00.000,0.76147,0.76166,0.76015,0.76154,4961.4502
15.06.2017 04:00:00.000,0.76154,0.76162,0.76104,0.76121,2977.6399
15.06.2017 05:00:00.000,0.7612,0.76154,0.76101,0.76151,3105.4399
...
I want to find lines in Set 2 in the same date with Set 1. I tried this: print(daily['Gmt time'][0].date == hourly['Gmt time'][0].date), but I don't know why it came out False. Isn't there a way to compare the date(just date, not including time) from two sets of data?
First read the data sets into dataframes:
import pandas as pd
df_one = pd.DataFrame.from_csv('data_set_one.csv', index_col=False)
df_two = pd.DataFrame.from_csv('data_set_two.csv', index_col=False)
Convert date column to date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
now compare both the dataframes:
for i, row in df_one.iterrows():
df_one_date = row['Gmt date']
print('df_one_date', df_one_date)
print(df_two[df_two['Gmt date'] == df_one_date])
print('----')
it's still unclear how you want to handle for different dates from df_one to match df_two. Hope this gives you enough idea on how to handle it.
Since using iterrows can be slow, a better option might be to use merge.
import pandas as pd
# load data
df_one = pd.read_csv('data_set_one.csv', index_col=False)
df_two = pd.read_csv('data_set_two.csv', index_col=False)
# convert times to datetime and then strip off the time to leave the date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
# merge
# selecting only the date in each dataframe for clarity
df_merge = df_two[['Gmt date']].merge(df_one[['Gmt date']], on=['Gmt date'], how='inner', right_index=True)
# get list of indices from df_two where dates exist in both frames
ix = list(df_two.index.unique())
print ix
[0, 1, 2, 3, 4, 5]

Upsample data and interpolate

I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!
I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Categories

Resources