Calculate time difference between a pandas dataframe column and a datetime object - python

I am trying to get the difference between a pandas dataframe column and a datetime object by using a customized function (years_between), here's how pandas dataframe looks like:
input_1['dataadmissao'].head(5)
0 2018-02-10
1 2009-08-23
2 2015-05-21
3 2016-12-17
4 2019-02-01
Name: dataadmissao, dtype: datetime64[ns]
And here's my code:
###################### function to return difference in years ####################
def years_between(start_year, end_year):
start_year = datetime.strptime(start_year, "%d/%m/%Y")
end_year = datetime.strptime(end_year, "%d/%m/%Y")
return abs(end_year.year - start_year.year)
input_1['difference_in_years'] = np.vectorize(years_between(input_1['dataadmissao'], datetime.now()))
Which returns:
TypeError: strptime() argument 1 must be str, not Series
How could I adjust the function to return a integer which represents the difference in years between pandas dataframe column and datetime.now()?

Use pandas.Timestamp.now:
>>> df
0 2018-02-10
1 2009-08-23
2 2015-05-21
3 2016-12-17
4 2019-02-01
Name: 1, dtype: datetime64[ns]
>>> pd.Timestamp.now() - df
0 1089 days 02:41:50.467993
1 4182 days 02:41:50.467993
2 2085 days 02:41:50.467993
3 1509 days 02:41:50.467993
4 733 days 02:41:50.467993
Name: 1, dtype: timedelta64[ns]
# If you want days
>>> (pd.Timestamp.now() - df).dt.days
0 1089
1 4182
2 2085
3 1509
4 733
Name: 1, dtype: int64
# If you want years
>>> (pd.Timestamp.now().year - df.dt.year)
0 3
1 12
2 6
3 5
4 2
Name: 1, dtype: int64

Simply subtract the series from datetime.datetime.now(), divide by the duration of one year, and convert to an integer:
import numpy as np
((datetime.now() - input_1['dataadmissao'])/np.timedelta64(1, 'Y')).astype(int)

Related

How to convert X min, Y sec string to timestamp

I have a dataframe with a duration column of strings in a format like:
index
duration
0
26 s
1
24 s
2
4 min, 37 s
3
7 s
4
1 min, 1 s
Is there a pandas or strftime() / strptime() way to convert the duration column to a min/sec timestamp.
I've attempted this way to convert strings, but I'll run into multiple scenarios after replacing strings:
for row in df['index']:
if "min, " in df['duration'][row]:
df['duration'][row] = df['duration'][row].replace(' min, ', ':').replace(' s', '')
else:
pass
Thanks in advance
Try:
pd.to_timedelta(df['duration'])
Output:
0 0 days 00:00:26
1 0 days 00:00:24
2 0 days 00:04:37
3 0 days 00:00:07
4 0 days 00:01:01
Name: duration, dtype: timedelta64[ns]

Can't group values in a Pandas column by a month

I would like to count the number of instances in the timelog, grouped by month. I have the following Pandas column:
print df['date_unconditional'][:5]
0 2018-10-15T07:00:00
1 2018-06-12T07:00:00
2 2018-08-28T07:00:00
3 2018-08-29T07:00:00
4 2018-10-29T07:00:00
Name: date_unconditional, dtype: object
Then I transformed it to datetime format
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'].dt.strftime('%m/%d/%Y'))
print df['date_unconditional'][:5]
0 2018-10-15
1 2018-06-12
2 2018-08-28
3 2018-08-29
4 2018-10-29
Name: date_unconditional, dtype: datetime64[ns]
And then I tried counting them, but I keep getting a mistake
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print df['date_unconditional'].groupby(pd.Grouper(freq='M')).count()
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Use parameter key in Grouper:
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.groupby(pd.Grouper(freq='M',key='date_unconditional'))['date_unconditional'].count())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, Name: date_unconditional, dtype: int64
Or create DatetimeIndex by DataFrame.set_index and then is possible use GroupBy.size - difference with between is count excluded missing values, size not.
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.set_index('date_unconditional').groupby(pd.Grouper(freq='M')).size())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, dtype: int64

Pandas datetime: can't convert object in dd-mm-YYYY format to datetime in correct format

I have a pandas dataframe
Sl.No Date1
1 08-09-1990
2 01-06-1988
3 04-10-1989
4 15-11-1991
5 01-06-1968
The dtype of Date1 is object
When i tried to convert this object to datetime format.
df["Date1"]= pd.to_datetime(df["Date1"])
I am getting the output as
0 1990-08-09
1 1988-01-06
2 1989-04-10
3 1991-11-15
4 2068-01-06
Also I tried with:
df["Date1"]= pd.to_datetime(df["Date1"],format='%d-%m-%Y')
and
df["Date1"]= pd.to_datetime(df["Date1"],format='%d-%m-%Y', dayfirst = True)
Problem is :
in index 0 the month and day is interchanged
in index 4 the year is taken incorrectly as 2068 instead of 1968
Pass the dayfirst to to_datetime
pd.to_datetime(df.Date1,dayfirst=True)
0 1990-09-08
1 1988-06-01
2 1989-10-04
3 1991-11-15
4 1968-06-01
Name: Date1, dtype: datetime64[ns]

How to extend an object series in length in python

I have a series:
0 2018-08-02 00:00:00
1 2016-07-20 00:00:00
2 2015-09-14 00:00:00
3 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
I wish to extend the series in length by one and shift it, such that the expected output is: (where today is the obviously todays date in the same format)
0 today()
1 2018-08-02 00:00:00
2 2016-07-20 00:00:00
3 2015-09-14 00:00:00
4 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
My current approach is to store the last value in the original series:
a = B[B.last_valid_index()]
then append:
B.append(a)
But I get the error:
TypeError: cannot concatenate object of type "<class 'pandas._libs.tslibs.timestamps.Timestamp'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
So I tried:
B.to_pydatetime() but with no luck.
Any ideas? I can not append nor extend the list, (ideally im appending) which are objects because they are a list of dates and times.
You can increment your index, add an item by label via pd.Series.loc, and then use sort_index.
It's not clear how last_valid_index is relevant given the input data you have provided.
s = pd.Series(['2018-08-02 00:00:00', '2016-07-20 00:00:00',
'2015-09-14 00:00:00', '2014-09-11 00:00:00'])
s = pd.to_datetime(s)
s.index += 1
s.loc[0] = pd.to_datetime('today')
s = s.sort_index()
Result
0 2018-09-05
1 2018-08-02
2 2016-07-20
3 2015-09-14
4 2014-09-11
dtype: datetime64[ns]
You can do appending here:
s = pd.Series([1,2,3,4])
s1 = pd.Series([5])
s1 = s1.append(s)
s1 = s1.reset_index(drop=True)
Simple and elegant output:
0 5
1 1
2 2
3 3
4 4

pick month start and end data in python

I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]

Categories

Resources