Can't group values in a Pandas column by a month - python

I would like to count the number of instances in the timelog, grouped by month. I have the following Pandas column:
print df['date_unconditional'][:5]
0 2018-10-15T07:00:00
1 2018-06-12T07:00:00
2 2018-08-28T07:00:00
3 2018-08-29T07:00:00
4 2018-10-29T07:00:00
Name: date_unconditional, dtype: object
Then I transformed it to datetime format
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'].dt.strftime('%m/%d/%Y'))
print df['date_unconditional'][:5]
0 2018-10-15
1 2018-06-12
2 2018-08-28
3 2018-08-29
4 2018-10-29
Name: date_unconditional, dtype: datetime64[ns]
And then I tried counting them, but I keep getting a mistake
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print df['date_unconditional'].groupby(pd.Grouper(freq='M')).count()
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

Use parameter key in Grouper:
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.groupby(pd.Grouper(freq='M',key='date_unconditional'))['date_unconditional'].count())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, Name: date_unconditional, dtype: int64
Or create DatetimeIndex by DataFrame.set_index and then is possible use GroupBy.size - difference with between is count excluded missing values, size not.
df['date_unconditional'] = pd.to_datetime(df['date_unconditional'], errors='coerce')
print (df.set_index('date_unconditional').groupby(pd.Grouper(freq='M')).size())
2018-06-30 1
2018-07-31 0
2018-08-31 2
2018-09-30 0
2018-10-31 2
Freq: M, dtype: int64

Related

Calculate time difference between a pandas dataframe column and a datetime object

I am trying to get the difference between a pandas dataframe column and a datetime object by using a customized function (years_between), here's how pandas dataframe looks like:
input_1['dataadmissao'].head(5)
0 2018-02-10
1 2009-08-23
2 2015-05-21
3 2016-12-17
4 2019-02-01
Name: dataadmissao, dtype: datetime64[ns]
And here's my code:
###################### function to return difference in years ####################
def years_between(start_year, end_year):
start_year = datetime.strptime(start_year, "%d/%m/%Y")
end_year = datetime.strptime(end_year, "%d/%m/%Y")
return abs(end_year.year - start_year.year)
input_1['difference_in_years'] = np.vectorize(years_between(input_1['dataadmissao'], datetime.now()))
Which returns:
TypeError: strptime() argument 1 must be str, not Series
How could I adjust the function to return a integer which represents the difference in years between pandas dataframe column and datetime.now()?
Use pandas.Timestamp.now:
>>> df
0 2018-02-10
1 2009-08-23
2 2015-05-21
3 2016-12-17
4 2019-02-01
Name: 1, dtype: datetime64[ns]
>>> pd.Timestamp.now() - df
0 1089 days 02:41:50.467993
1 4182 days 02:41:50.467993
2 2085 days 02:41:50.467993
3 1509 days 02:41:50.467993
4 733 days 02:41:50.467993
Name: 1, dtype: timedelta64[ns]
# If you want days
>>> (pd.Timestamp.now() - df).dt.days
0 1089
1 4182
2 2085
3 1509
4 733
Name: 1, dtype: int64
# If you want years
>>> (pd.Timestamp.now().year - df.dt.year)
0 3
1 12
2 6
3 5
4 2
Name: 1, dtype: int64
Simply subtract the series from datetime.datetime.now(), divide by the duration of one year, and convert to an integer:
import numpy as np
((datetime.now() - input_1['dataadmissao'])/np.timedelta64(1, 'Y')).astype(int)

Getting min value across multiple datetime columns in Pandas

I have the following dataframe
df = pd.DataFrame({
'DATE1': ['NaT', 'NaT', '2010-04-15 19:09:08+00:00', '2011-01-25 15:29:37+00:00', '2010-04-10 12:29:02+00:00', 'NaT'],
'DATE2': ['NaT', 'NaT', 'NaT', 'NaT', '2014-04-10 12:29:02+00:00', 'NaT']})
df.DATE1 = pd.to_datetime(df.DATE1)
df.DATE2 = pd.to_datetime(df.DATE2)
and I would like to create a new column with the minimum value across the two columns (ignoring the NaTs) like so:
df.min(axis=1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
If I remove the timezone information (the +00:00) from every single cell then the desired output is produced like so:
0 NaT
1 NaT
2 2010-04-15 19:09:08
3 2011-01-25 15:29:37
4 2010-04-10 12:29:02
5 NaT
dtype: datetime64[ns]
Why does adding the timezone information break the function? My dataset has timezones so I would need to know how to remove them as a workaround.
This is good question, it should be a bug here with timezone
df.apply(lambda x : np.max(x),1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2014-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]
Odd. Seems like a bug. You could keep the timezone format and use this.
df.apply(lambda x: x.min(),axis=1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2010-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]

Pandas datetime: can't convert object in dd-mm-YYYY format to datetime in correct format

I have a pandas dataframe
Sl.No Date1
1 08-09-1990
2 01-06-1988
3 04-10-1989
4 15-11-1991
5 01-06-1968
The dtype of Date1 is object
When i tried to convert this object to datetime format.
df["Date1"]= pd.to_datetime(df["Date1"])
I am getting the output as
0 1990-08-09
1 1988-01-06
2 1989-04-10
3 1991-11-15
4 2068-01-06
Also I tried with:
df["Date1"]= pd.to_datetime(df["Date1"],format='%d-%m-%Y')
and
df["Date1"]= pd.to_datetime(df["Date1"],format='%d-%m-%Y', dayfirst = True)
Problem is :
in index 0 the month and day is interchanged
in index 4 the year is taken incorrectly as 2068 instead of 1968
Pass the dayfirst to to_datetime
pd.to_datetime(df.Date1,dayfirst=True)
0 1990-09-08
1 1988-06-01
2 1989-10-04
3 1991-11-15
4 1968-06-01
Name: Date1, dtype: datetime64[ns]

How to extend an object series in length in python

I have a series:
0 2018-08-02 00:00:00
1 2016-07-20 00:00:00
2 2015-09-14 00:00:00
3 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
I wish to extend the series in length by one and shift it, such that the expected output is: (where today is the obviously todays date in the same format)
0 today()
1 2018-08-02 00:00:00
2 2016-07-20 00:00:00
3 2015-09-14 00:00:00
4 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
My current approach is to store the last value in the original series:
a = B[B.last_valid_index()]
then append:
B.append(a)
But I get the error:
TypeError: cannot concatenate object of type "<class 'pandas._libs.tslibs.timestamps.Timestamp'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
So I tried:
B.to_pydatetime() but with no luck.
Any ideas? I can not append nor extend the list, (ideally im appending) which are objects because they are a list of dates and times.
You can increment your index, add an item by label via pd.Series.loc, and then use sort_index.
It's not clear how last_valid_index is relevant given the input data you have provided.
s = pd.Series(['2018-08-02 00:00:00', '2016-07-20 00:00:00',
'2015-09-14 00:00:00', '2014-09-11 00:00:00'])
s = pd.to_datetime(s)
s.index += 1
s.loc[0] = pd.to_datetime('today')
s = s.sort_index()
Result
0 2018-09-05
1 2018-08-02
2 2016-07-20
3 2015-09-14
4 2014-09-11
dtype: datetime64[ns]
You can do appending here:
s = pd.Series([1,2,3,4])
s1 = pd.Series([5])
s1 = s1.append(s)
s1 = s1.reset_index(drop=True)
Simple and elegant output:
0 5
1 1
2 2
3 3
4 4

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

I am trying to parse a date column that looks like the one below,
date
061116
061216
061316
061416
However I cannot get pandas to accept the date format as there is no delimiter (eg '/'). I have tried this below but receive the error:
ValueError: time data '060116' does not match format '%dd%mm%YY' (match)
pd.to_datetime(df['Date'], format='%dd%mm%YY')
You need add parameter errors='coerce' to_datetime, because 13 and 14 months does not exist, so this dates are converted to NaT:
print (pd.to_datetime(df['Date'], format='%d%m%y', errors='coerce'))
0 2016-11-06
1 2016-12-06
2 NaT
3 NaT
Name: Date, dtype: datetime64[ns]
Or maybe you need swap months with days:
print (pd.to_datetime(df['Date'], format='%m%d%y'))
0 2016-06-11
1 2016-06-12
2 2016-06-13
3 2016-06-14
Name: Date, dtype: datetime64[ns]
EDIT:
print (df)
Date
0 0611160130
1 0612160130
2 0613160130
3 0614160130
print (pd.to_datetime(df['Date'], format='%m%d%y%H%M', errors='coerce'))
0 2016-06-11 01:30:00
1 2016-06-12 01:30:00
2 2016-06-13 01:30:00
3 2016-06-14 01:30:00
Name: Date, dtype: datetime64[ns]
Python's strftime directives.
Your date format is wrong. You have days and months reversed. It should be:
%m%d%Y

Categories

Resources