How to index DateTime in Pandas dataframe - python

I have a few columns where some values are DateTime and some are simply the year. I'd like to index the values of the datetime such that if I = I[:4] I get the year for variable I after looping through the column instead of an error stating 'datetime.datetime' object is not subscriptable. Essentially, I'd like to drop everything from datetime that isn't the year in a column with mixed ints that have values that are only the year, and datetime instances from which I would only like to retrieve the year.

I am not exactly sure what you are trying to do and unfortunately I cannot post comments (yet).
From what I guess, you have a DataFrame with a column of mixed types, and you want to convert all values with type datetime to int. As an example, here is a DataFrame:
>>> data = [[1, 1990], [2, datetime(1991, 1, 1)]]
>>> df = pd.DataFrame(data, columns=['id', 'time'])
>>> df
id time
0 1 1990
1 2 1991-01-01 00:00:00
You can use map(link) to convert the second column:
>>> df.loc[:,'date'] = df.loc[:,'date'].map(lambda x: x.year if isinstance(x, datetime) else x)
>>> df
id time
0 1 1990
1 2 1991

Related

Add missing dates do datetime column in Pandas using last value

I've already checked out Add missing dates to pandas dataframe, but I don't want to fill in the new dates with a generic value.
My dataframe looks more or less like this:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
06/01/2000
d
So in this example, days 04/01/2000 and 05/01/2000 are missing. What I want to do is to insert them before the 6th, with a value of c, the last value before the missing days. So the "correct" df should look like:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
04/01/2000
c
05/01/2000
c
06/01/2000
d
There are multiple instances of missing dates, and it's a large df (~9000 rows).
Thanks for your time! :)
try this:
# If your date format is dayfirst, then use the following code
df['date (dd/mm/yyyy)'] = pd.to_datetime(df['date (dd/mm/yyyy)'], dayfirst=True)
out = df.set_index('date (dd/mm/yyyy)').asfreq('D', method='ffill').reset_index()
print(out)
Assuming that your dates are drawn at a regular frequency, you can generate a pd.DateIndex with date_range, filter those which are not in your date column, crate a dataframe to concatenate with nan in the value column and fillna using the back or forward fill method.
# assuming your dataframe is df:
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq='M')
known_dates = set(df.date.to_list()) # set is blazing fast on `in` compared with a list.
unknown_dates = all_dates[~all_dates.isin(known_dates)]
df2 = pd.DateFrame({'date': unknown_dates})
df2['value'] = np.nan
df = pd.concat([df, df2])
df = df.sort_values('value').fillna(method='ffill')

Bug/Feature for pandas where a multi-indexed dataframe filtered by date returns all the unfiltered dates when extracting the date index level

This is easiest to explain by code, so here goes - imagine the commands in ipython/jupyter notebooks:
from io import StringIO
import pandas as pd
test = StringIO("""Date,Ticker,x,y
2008-10-23,A,0,10
2008-10-23,B,1,11
2008-10-24,A,2,12
2008-10-24,B,3,13
2008-10-25,A,4,14
2008-10-25,B,5,15
2008-10-26,A,6,16
2008-10-26,B,7,17
""")
# Multi-index by Date and Ticker
df = pd.read_csv(test, index_col=[0, 1], parse_dates=True)
df
# Output to the command line
x y
Date Ticker
2008-10-23 A 0 10
B 1 11
2008-10-24 A 2 12
B 3 13
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
ts = pd.Timestamp(2008, 10, 25)
# Filter the data by Date >= ts
filtered_df = df.loc[ts:]
# output the filtered data
filtered_df
x y
Date Ticker
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
# Get all the level 0 data (i.e. the dates) in the filtered dataframe
dates = filtered_df.index.levels[0]
# output the dates in the filtered dataframe:
dates
DatetimeIndex(['2008-10-23', '2008-10-24', '2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
# WTF!!!??? This was ALL of the dates in the original dataframe - I asked for the dates in the filtered dataframe!
# The correct output should have been:
DatetimeIndex(['2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
So clearly in multi-indexing, when one filters, the index of the filtered dataframe retains all of the indices of the original dataframe, but only shows the visible indices when viewing the entire dataframe. However, when looking at data by index levels, it appears there is a bug (feature somehow?) where the entire index including the invisible indices is used to perform the operation I did to extract all the dates in the code above.
This is actually explained in the MultiIndex's User Guide (emphasis added):
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. ... This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.
In your case:
>>> filtered_df.index.get_level_values(0)
DatetimeIndex(['2008-10-25', '2008-10-25', '2008-10-26', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
Which is what you expected.

reindex to add missing dates to pandas dataframe

I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.
Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.

Slice by date in pandas without re-indexing

I have a pandas dataframe where one of the columns is made up of strings representing dates, which I then convert to python timestamps by using pd.to_datetime().
How can I select the rows in my dataframe that meet conditions on date.
I know you can use the index (like in this question) but my timestamps are not unique.
How can I select the rows where the 'Date' field is say, after 2015-03-01?
You can use a mask on the date, e.g.
df[df['date'] > '2015-03-01']
Here is a full example:
>>> df = pd.DataFrame({'date': pd.date_range('2015-02-15', periods=5, freq='W'),
'val': np.random.random(5)})
>>> df
date val
0 2015-02-15 0.638522
1 2015-02-22 0.942384
2 2015-03-01 0.133111
3 2015-03-08 0.694020
4 2015-03-15 0.273877
>>> df[df.date > '2015-03-01']
date val
3 2015-03-08 0.694020
4 2015-03-15 0.273877

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Categories

Resources