Python: Converting datetime to ordinal - python

I have a list(actually a column in pandas DataFrame if this matters) of Timestamps and I'm trying to convert every element of the list to ordinal format. So I run a for loop through the list(is there a faster way?) and use:
import datetime as dt
a = a.toordinal()
or
import datetime as dt
a = dt.datetime.toordinal(a)
however the following happened(for simplicity):
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: Timestamp('1970-01-01 00:00:00.000737418')
The result makes absolutely non sense to me. Obviously what I was trying to get is:
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: 737418
What went wrong?
console output screenshot

What went wrong?
Your question is a bit misleading, and the screenshot shows what is going on.
Normally, when you write
a = b
in Python, it will bind the name a to the object bound to b. In this case, you will have
id(a) == id(b)
In your case, however, contrary to your question, you're actually doing the assignment
a[0] = b
This will call a method of a, assigning b to its 0 index. The object's class determines what happens in this case. Here, specifically, a is a pandas.Series, and it converts the object in order to conform to its dtype.

Please don't loop. It's not necessary.
#!/usr/bin/env python
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dates': [datetime(1990, 4, 28),
datetime(2018, 4, 13),
datetime(2017, 11, 4)]})
print(df)
print(df['dates'].dt.weekday_name)
print(df['dates'].dt.weekday)
print(df['dates'].dt.month)
print(df['dates'].dt.year)
gives the dataframe:
dates
0 1990-04-28
1 2018-04-13
2 2017-11-04
And the printed values
0 Saturday
1 Friday
2 Saturday
Name: dates, dtype: object
0 5
1 4
2 5
Name: dates, dtype: int64
0 4
1 4
2 11
Name: dates, dtype: int64
0 1990
1 2018
2 2017
Name: dates, dtype: int64
For the toordinal, you need to "loop" with apply:
print(df['dates'].apply(lambda x: x.toordinal()))
gives the following pandas series
0 726585
1 736797
2 736637
Name: dates, dtype: int64

Related

Not able to understand the use of .mode() in python

I have a requirement where I need to find out the most popular start hour.
Following is the code that has helped me in finding the correct solution.
import time
import pandas as pd
import numpy as np
# bunch of code comes
# here
# that help in reaching the following steps
df = pd.read_csv(CITY_DATA[selected_city])
# convert the Start Time column to datetime
df['Start Time'] = pd.to_datetime(df['Start Time'])
# extract hour from the Start Time column to create an hour column
df['hour'] = df['Start Time'].dt.hour
# extract month and day of week from Start Time to create new columns
df['month'] = df['Start Time'].dt.month
df['day_of_week'] = df['Start Time'].dt.weekday_name
# find the most popular hour
popular_hour = df['hour'].mode()[0]
here is a sample o/p that i get when i try to run this query
"print(df['hour'])"
0 15
1 17
2 8
3 13
4 14
5 9
6 9
7 17
8 16
9 17
10 7
11 17
Name: hour, Length: 300000, dtype: int64
The o/p that i get when i use
print(type(df['hour']))
<class 'pandas.core.series.Series'>
The value of the most popular start hour is stored in popular_hour which is equal to "17" (It is the correct value)
However I am not able to understand the part of .mode()[0]
What does this .mode() do and why [0] ?
And will the same concept be to calculate popular month and popular day of the week also irrespective of their datatype
mode returns a Series:
df.mode()
0 17
dtype: int64
From this, you take the first item by calling
df.mode()[0]
17
Note that a Series is always returned, and sometimes if there are multiple values for mode, they are all returned:
pd.Series([1, 1, 2, 2, 3, 3]).mode()
0 1
1 2
2 3
dtype: int64
You would still take the first value each time and discard the rest. Note that when multiple modes are returned, they are always sorted.
Read the documentation on mode for more info.

Need to convert entire column from string format to date format from a Dataframe

I am trying to convert an entire column from a Dataframe from a string to date format. Does anyone have any recommendations on how to do this?
I've successfully been able to change one element of this using strptime, but not sure the best way to apply this to the entire column:
Sample of the raw Data:
2016/06/28 02:13:51
In:
Var1 = Dataframename['columnname'][0]
temp = datetime.datetime.strptime(Var1, '%Y/%m/%d %H:%M:%S')
temp
Out:
datetime.datetime(2016, 6, 28, 2, 13, 51)
I think you are looking for this
import datetime
data['colname']=data['colname'].apply(lambda x:datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
You can use to_datetime:
df = pd.DataFrame({'b':['2016/06/28 02:13:51','2016/06/28 02:13:51','2016/06/28 02:13:51'],
'a':[4,5,6]})
print (df)
a b
0 4 2016/06/28 02:13:51
1 5 2016/06/28 02:13:51
2 6 2016/06/28 02:13:51
df['b'] = pd.to_datetime(df.b)
print (df)
a b
0 4 2016-06-28 02:13:51
1 5 2016-06-28 02:13:51
2 6 2016-06-28 02:13:51
print (df.dtypes)
a int64
b datetime64[ns]
dtype: object

How to convert datetime object to milliseconds

I am parsing datetime values as follows:
df['actualDateTime'] = pd.to_datetime(df['actualDateTime'])
How can I convert this datetime objects to milliseconds?
I didn't see mention of milliseconds in the doc of to_datetime.
Update (Based on feedback):
This is the current version of the code that provides error TypeError: Cannot convert input to Timestamp. The column Date3 must contain milliseconds (as a numeric equivalent of a datetime object).
import pandas as pd
import time
s1 = {'Date' : ['2015-10-20T07:21:00.000','2015-10-19T07:18:00.000','2015-10-19T07:15:00.000']}
df = pd.DataFrame(s1)
df['Date2'] = pd.to_datetime(df['Date'])
t = pd.Timestamp(df['Date2'])
df['Date3'] = time.mktime(t.timetuple())
print df
You can try pd.to_datetime(df['actualDateTime'], unit='ms')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
says this will denote in epoch, with variations 's','ms', 'ns' ...
Update
If you want in epoch timestamp of the form 14567899..
import pandas as pd
import time
t = pd.Timestamp('2015-10-19 07:22:00')
time.mktime(t.timetuple())
>> 1445219520.0
Latest update
df = pd.DataFrame(s1)
df1 = pd.to_datetime(df['Date'])
pd.DatetimeIndex(df1)
>>>DatetimeIndex(['2015-10-20 07:21:00', '2015-10-19 07:18:00',
'2015-10-19 07:15:00'],
dtype='datetime64[ns]', freq=None)
df1.astype(np.int64)
>>>0 1445325660000000000
1 1445239080000000000
2 1445238900000000000
df1.astype(np.int64) // 10**9
>>>0 1445325660
1 1445239080
2 1445238900
Name: Date, dtype: int64
Timestamps in pandas are always in nanoseconds.
This gives you milliseconds since the epoch (1970-01-01):
df['actualDateTime'] = df['actualDateTime'].astype(np.int64) / int(1e6)
This will return milliseconds from epoch
timestamp_object.timestamp() * 1000
pandas.to_datetime is to convert string or few other datatype to pandas datetime[ns]
In your instance initial 'actualDateTime' is not having milliseconds.So, if you are parsing a column which has milliseconds you will get data.
for example,
df
Out[60]:
a b
0 2015-11-02 18:04:32.926 0
1 2015-11-02 18:04:32.928 1
2 2015-11-02 18:04:32.927 2
df.a
Out[61]:
0 2015-11-02 18:04:32.926
1 2015-11-02 18:04:32.928
2 2015-11-02 18:04:32.927
Name: a, dtype: object
df.a = pd.to_datetime(df.a)
df.a
Out[63]:
0 2015-11-02 18:04:32.926
1 2015-11-02 18:04:32.928
2 2015-11-02 18:04:32.927
Name: a, dtype: datetime64[ns]
df.a.dt.nanosecond
Out[64]:
0 0
1 0
2 0
dtype: int64
df.a.dt.microsecond
Out[65]:
0 926000
1 928000
2 927000
dtype: int64
For what it's worth, to convert a single Pandas timestamp object to milliseconds, I had to do:
import time
time.mktime(<timestamp_object>.timetuple())*1000
For python >= 3.8, for e.g.
pd.DataFrame({'temp':[1,2,3]}, index = [pd.Timestamp.utcnow()]*3)
convert to milliseconds:
times = df.index.view(np.int64) // int(1e6)
print(times[0])
gives:
1666925409051
Note: to convert to seconds, similarly e.g.:
times = df.index.view(np.int64) // int(1e9)
print(times[0])
1666925409
from datetime import datetime
print datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
>>>> OUTPUT >>>>
2015-11-02 18:04:32.926

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Split a Pandas 'findall' result list into multiple items, to group by uniques

I've downloaded my Twitter archive and I'm trying to do some analysis on who I have talked to the most.
Tweets CSV columns look like this:
tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,timestamp,source
I've used read_csv() to import the tweets.csv file into a dataframe called "indata".
Then, to get a list of all the #handles mentioned in tweets, I used the following:
handles = indata['text'].str.findall('#[a-zA-Z0-9_-]*')
Result:
timestamp
...
2013-04-12 11:24:27 [#danbarker]
2013-04-12 11:22:32 [#SeekTom]
2013-04-12 10:50:45 [#33Digital, #HotwirePR, #kobygeddes, #]
2013-04-12 08:00:03 [#mccandelish]
2013-04-12 07:59:01 [#Mumbrella]
...
Name: text, dtype: object
What I'd like to be able to do is group by the individual handles and dates, to show counts of who I've spoken to the most over the years.
Any suggestions?
A purely pandas way might be to apply the Series constructor to put this into one DataFrame and stack into a Series (so you can use value_counts)... if you didn't care about the index/timestamp you could use collections (which may be faster):
In [11]: df = pd.DataFrame([['#a #b'], ['#a'], ['#c']], columns=['tweets'])
In [12]: df
Out[12]:
tweets
0 #a #b
1 #a
2 #c
In [13]: at_mentions = df['tweets'].str.findall('#[a-zA-Z0-9_]+')
Note: I'd use + rather than * here since I don't think # by itself should be included.
In [14]: at_mentions
Out[14]:
0 [#a, #b]
1 [#a]
2 [#c]
Name: tweets, dtype: object
Using collections' Counter this is very easy:
In [21]: from collections import Counter
In [22]: Counter(at_mentions.sum())
Out[22]: Counter({'#a': 2, '#b': 1, '#c': 1})
The pandas way will keep the index (time) information.
Apply Series constructor to get a DataFrame and stack it into a Series:
In [31]: all_mentions = at_mentions.apply(pd.Series)
In [32]: all_mentions
Out[33]:
0 1
0 #a #b
1 #a NaN
2 #c NaN
We can tidy the names here to be more descriptive about what's going on:
In [33]: all_mentions.columns.name = 'at_number'
In [34]: all_mentions.index.name = 'tweet' # this is timestamp in your example
Now when we stack, we see the names of the levels:
In [35]: all_mentions = all_mentions.stack()
In [36]: all_mentions
Out[36]:
tweet at_number
1 0 #a
1 #b
2 0 #a
3 0 #c
dtype: object
We could do lots of other analysis here, for example value_counts:
In [37]: all_mentions.value_counts()
Out[37]:
#a 2
#c 1
#b 1
dtype: int64
The final result is equivalent to pd.Series(Counter(at_mentions.sum())).

Categories

Resources