Convert column of date objects in Pandas DataFrame to strings - python

How to convert a column consisting of datetime64 objects to a strings that would read
01-11-2013 for today's date of November 1.
I have tried
df['DateStr'] = df['DateObj'].strftime('%d%m%Y')
but I get this error
AttributeError: 'Series' object has no attribute 'strftime'

As of version 17.0, you can format with the dt accessor:
df['DateStr'] = df['DateObj'].dt.strftime('%d%m%Y')

In [6]: df = DataFrame(dict(A = date_range('20130101',periods=10)))
In [7]: df
Out[7]:
A
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 2013-01-04 00:00:00
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 2013-01-08 00:00:00
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
In [8]: df['A'].apply(lambda x: x.strftime('%d%m%Y'))
Out[8]:
0 01012013
1 02012013
2 03012013
3 04012013
4 05012013
5 06012013
6 07012013
7 08012013
8 09012013
9 10012013
Name: A, dtype: object

It works directly if you first set as index. Then essentially you pass a 'DatetimeIndex' object and not a 'Series'
df = df.set_index('DateObj').copy()
df['DateStr'] = df.index.strftime('%d%m%Y')

Related

How to select an item by its ID and not by its index position [duplicate]

I have a pandas dataframe:
import pandas as pnd
d = pnd.Timestamp('2013-01-01 16:00')
dates = pnd.bdate_range(start=d, end = d+pnd.DateOffset(days=10), normalize = False)
df = pnd.DataFrame(index=dates, columns=['a'])
df['a'] = 6
print(df)
a
2013-01-01 16:00:00 6
2013-01-02 16:00:00 6
2013-01-03 16:00:00 6
2013-01-04 16:00:00 6
2013-01-07 16:00:00 6
2013-01-08 16:00:00 6
2013-01-09 16:00:00 6
2013-01-10 16:00:00 6
2013-01-11 16:00:00 6
I am interested in find the label location of one of the labels, say,
ds = pnd.Timestamp('2013-01-02 16:00')
Looking at the index values, I know that is integer location of this label 1. How can get pandas to tell what the integer value of this label is?
You're looking for the index method get_loc:
In [11]: df.index.get_loc(ds)
Out[11]: 1
Get dataframe integer index given a date key:
>>> import pandas as pd
>>> df = pd.DataFrame(
index=pd.date_range(pd.datetime(2008,1,1), pd.datetime(2008,1,5)),
columns=("foo", "bar"))
>>> df["foo"] = [10,20,40,15,10]
>>> df["bar"] = [100,200,40,-50,-38]
>>> df
foo bar
2008-01-01 10 100
2008-01-02 20 200
2008-01-03 40 40
2008-01-04 15 -50
2008-01-05 10 -38
>>> df.index.get_loc(df["bar"].argmax())
1
>>> df.index.get_loc(df["foo"].argmax())
2
In column bar, the index of the maximum value is 1
In column foo, the index of the maximum value is 2
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_loc.html
get_loc can be used for rows and columns according to:
import pandas as pnd
d = pnd.Timestamp('2013-01-01 16:00')
dates = pnd.bdate_range(start=d, end = d+pnd.DateOffset(days=10), normalize = False)
df = pnd.DataFrame(index=dates)
df['a'] = 5
df['b'] = 6
print(df.head())
a b
2013-01-01 16:00:00 5 6
2013-01-02 16:00:00 5 6
2013-01-03 16:00:00 5 6
2013-01-04 16:00:00 5 6
2013-01-07 16:00:00 5 6
#for rows
print(df.index.get_loc('2013-01-01 16:00:00'))
0
#for columns
print(df.columns.get_loc('b'))
1
Because get_loc returns a mask rather than a list of integer index locations when there are multiple instances of the key in the index, I was toying with an answer using reset_index():
# Add a duplicate!!!
dup = pd.Timestamp('2013-01-07 16:00')
df = df.append(pd.DataFrame([7],columns=['a'],index=[dup]))
df
a
2013-01-01 16:00:00 6
2013-01-02 16:00:00 6
2013-01-03 16:00:00 6
2013-01-04 16:00:00 6
2013-01-07 16:00:00 6
2013-01-08 16:00:00 6
2013-01-09 16:00:00 6
2013-01-10 16:00:00 6
2013-01-11 16:00:00 6
2013-01-07 16:00:00 7
2013-01-08 16:00:00 3
# Only use this method if the key has duplicates
if (df.loc[dup].index.has_duplicates):
df.reset_index().loc[df.index.get_loc(dup)].index.to_list()
array([4, 9])

resample dataframe for every hour

I want to resample the data in Sms ,call and Internet column by replacing the value by their mean for every hour.
Code 1 tried :
df1.reset_index().set_index('TIME').resample('1H').mean()
error:Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Code 2 tried:
df1['TIME'] = pd.to_datetime(data['TIME'])
df1.CALL.resample('60min', how='mean')
error: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Dataframe:
ID TIME SMS CALL INTERNET
0 1 2013-11-30 23:00:00 0.277204 0.273629 13.674575
1 1 2013-11-30 23:10:00 0.341536 0.058176 13.330858
2 1 2013-11-30 23:20:00 0.379427 0.054601 11.329552
3 1 2013-11-30 23:30:00 0.600781 0.218489 13.166163
4 1 2013-11-30 23:40:00 0.405565 0.134176 13.347791
5 1 2013-11-30 23:50:00 0.187700 0.080738 12.434744
6 1 2013-12-01 00:00:00 0.282651 0.135964 13.860353
7 1 2013-12-01 00:10:00 0.109826 0.056388 12.583463
8 1 2013-12-01 00:20:00 0.348638 0.053438 12.644995
9 1 2013-12-01 00:30:00 0.138375 0.054062 12.251733
10 1 2013-12-01 00:40:00 0.054062 0.163803 11.292642
df1.dtypes
ID int64
TIME object
SMS float64
CALL float64
INTERNET float64
dtype: object
You can use parameter on in resample:
on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
df1['TIME'] = pd.to_datetime(df1['TIME'])
df = df1.resample('60min', on='TIME').mean()
print (df)
ID SMS CALL INTERNET
TIME
2013-11-30 23:00:00 1 0.365369 0.136635 12.880614
2013-12-01 00:00:00 1 0.186710 0.092731 12.526637
Or add set_index for DatetimeIndex:
df1['TIME'] = pd.to_datetime(df1['TIME'])
df = df1.set_index('TIME').resample('60min').mean()

Pandas: Adding varying numbers of days to a date in a dataframe

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Pandas time series indexing -- re

I have a pandas dataframe indexed by time:
>>> dframe.head()
aw_FATFREEMASS raw aw_FATFREEMASS sym
TIMESTAMP
2011-12-08 23:13:23 139.3 H
2011-12-08 23:12:18 139.2 H
2011-12-08 22:31:53 139.2 H
2011-12-09 07:08:50 138.2 H
2011-12-10 21:36:20 137.6 H
[5 rows x 2 columns]
>>> type(dframe.index)
<class 'pandas.tseries.index.DatetimeIndex'>
I'm trying to do a simple time series query similar to this SQL:
SELECT * FROM dframe WHERE tstart <= TIMESTAMP <= tend
where tstart and tend are appropriately represented timestamps. With pandas I'm getting behavior I just don't understand.
This does what I expect:
>>> dframe['2011-11-01' : '2011-11-20']
Empty DataFrame
Columns: [aw_FATFREEMASS raw, aw_FATFREEMASS sym]
Index: []
[0 rows x 2 columns]
This does the same thing:
dframe['2011-11-01 00:00:00' : '2011-11-20 00:00:00']
However:
>>> from dateutil.parser import parse
>>> dframe[parse('2011-11-01 00:00:00') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : parse('2011-11-01')]
*** KeyError: Timestamp('2011-11-01 00:00:00', tz=None)
When I provide a time represented as a pandas Timestamp I get slice behavior I don't understand. Can someone explain this behavior and/or tell me how I can achieve the SQL query above?
docs are here
This is called partial string indexing. In a nutshell, providing a string will get you results that 'match', e.g. they are included in the specified interval, while if you specify a Timestamp/datetime then its exact; it HAS to be in the index.
Can you show how you constructed the DatetimeIndex?
what version pandas?
In [4]: df = DataFrame(np.random.randn(20,2),index=date_range('20130101',periods=20,freq='H'))
In [5]: df
Out[5]:
0 1
2013-01-01 00:00:00 -0.339751 1.223660
2013-01-01 01:00:00 0.525203 -0.987815
2013-01-01 02:00:00 1.724239 0.213446
2013-01-01 03:00:00 -0.074797 -1.658876
2013-01-01 04:00:00 0.483425 -2.112314
2013-01-01 05:00:00 0.094140 0.327681
2013-01-01 06:00:00 -1.265337 -0.858521
2013-01-01 07:00:00 -1.470041 0.168871
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
2013-01-01 11:00:00 1.070209 0.720150
2013-01-01 12:00:00 0.887571 -0.611207
2013-01-01 13:00:00 1.669451 -0.022434
2013-01-01 14:00:00 -1.796565 -1.186899
2013-01-01 15:00:00 0.417758 0.082021
2013-01-01 16:00:00 -1.064019 -0.377208
2013-01-01 17:00:00 0.939902 0.430784
2013-01-01 18:00:00 -0.645667 1.611992
2013-01-01 19:00:00 -0.172148 -1.725041
[20 rows x 2 columns]
In [6]: df['20130101 7:00:01':'20130101 10:00:00']
Out[6]:
0 1
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
[3 rows x 2 columns]
In [7]: df.index
Out[7]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 19:00:00]
Length: 20, Freq: H, Timezone: None
If you already have Timestamps/datetimes, then just construct a boolean expression
df[(df.index > Timestamp('20130101 10:00:00')) & (df.index < Timestamp('201301010 17:00:00')])

Pandas fillna on datetime object

I'm trying to run fillna on a column of type datetime64[ns]. When I run something like:
df['date'].fillna(datetime("2000-01-01"))
I get:
TypeError: an integer is required
Any way around this?
This should work in 0.12 and 0.13 (just released).
#DSM points out that datetimes are constructed like: datetime.datetime(2012,1,1)
SO the error is from failing to construct the value the you are passing to fillna.
Note that using a Timestamp WILL parse the string.
In [3]: s = Series(date_range('20130101',periods=10))
In [4]: s.iloc[3] = pd.NaT
In [5]: s.iloc[7] = pd.NaT
In [6]: s
Out[6]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 NaT
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 NaT
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
dtype: datetime64[ns]
datetime.datetime will work as well
In [7]: s.fillna(Timestamp('20120101'))
Out[7]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 2012-01-01 00:00:00
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 2012-01-01 00:00:00
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
dtype: datetime64[ns]
Right now, df['date'].fillna(pd.Timestamp("20210730")) works in pandas 1.3.1
This example is works with dynamic data if you want to replace NaT data in rows with data from another DateTime data.
df['column_with_NaT'].fillna(df['dt_column_with_thesame_index'], inplace=True)
It's works for me when I was updated some rows in DateTime column and not updated rows had NaT value, and I've been needed to inherit old series data. And this code above resolve my problem. Sry for the not perfect English )

Categories

Resources