Pandas convert string to end of month date - python

I have this problem where 1 of the columns in my df is entered in as string, but I want to convert it into end of date month in python. For example,
Id Name Date Number
0 1 A 201601 5
1 2 B 201602 6
2 3 C 201603 4
The Date column has the year and month as string. Ideally, my goal is:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4
I was able to do this on excel using Endmonth and cut string, but when I tried pd.to_datetime in python, it didn't work. Thanks!

we can using MonthEnd
from pandas.tseries.offsets import MonthEnd
df.Date=(pd.to_datetime(df.Date,format='%Y%m')+MonthEnd(1)).dt.strftime('%m/%d/%Y')
df
Out[1336]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4

you can use PeriodIndex:
In [36]: df['Date'] = pd.PeriodIndex(df['Date'].astype(str), freq='M').strftime('%m/%d/%Y')
In [37]: df
Out[37]:
Id Name Date Number
0 1 A 01/31/2016 5
1 2 B 02/29/2016 6
2 3 C 03/31/2016 4

Related

Create a new column based on timestamp values to count the days by steps

I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear

Get row with closest date in other dataframe pandas

I have 2 dataframes
Dataframe1:
id date1
1 11-04-2022
1 03-02-2011
2 03-05-2222
3 01-01-2001
4 02-02-2012
and Dataframe2:
id date2 data data2
1 11-02-2222 1 3
1 11-02-1999 3 4
1 11-03-2022 4 5
2 22-03-4444 5 6
2 22-02-2020 7 8
...
What I would like to do is take the row from dataframe2 with the closest date to date1 in Dataframe1 but it has to fit the id, but the date has to be before the one of date1
The desired output would look like this:
id date1 date2 data data2
1 11-04-2022 11-03-2022 4 5
1 03-02-2011 11-02-1999 3 4
2 03-05-2222 22-02-2020 7 8
How would I do this using pandas?
Try pd.merge_asof, but first convert date1, date2 to datetime and sort both timeframes:
df1["date1"] = pd.to_datetime(df1["date1"])
df2["date2"] = pd.to_datetime(df2["date2"])
df1 = df1.sort_values(by="date1")
df2 = df2.sort_values(by="date2")
print(
pd.merge_asof(
df1,
df2,
by="id",
left_on="date1",
right_on="date2",
direction="nearest",
).dropna(subset=["date2"])
)

How to select the 3 last dates in Python

I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o

Creating a new Column "Week" from existing column of Date

I have a dataset which has a Column date in a continuous format. I would like to add a new column to it which takes out week from that value in the Date column.
A B
1 20050121
2 20050111
3 20050205
4 20050101
Here the B column idicates the date in the YEAR|MONTH|DAY format, I would like to add a new column to this dataset which takes in the month date from the dataset and tells us which week it belongs to, something like this:
A B C
1 20050121 3
2 20050111 2
3 20050205 5
4 20050101 1
The week starts from the 1st january of 2005. I thought of splitting the values of moth and date separately and then calculate according to these two values, How can I do this?
It seems you need strftime by http://strftime.org/:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.strftime('%W')
print (df)
A B C
0 1 20050121 03
1 2 20050111 02
2 3 20050205 05
3 4 20050101 00
If need ints:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.strftime('%W').astype(int)
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 0
If use weekofyear get more as 50 for first week:
df['C'] = pd.to_datetime(df['B'], format='%Y%m%d').dt.weekofyear
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 53
But is possible mask it:
dates = pd.to_datetime(df['B'], format='%Y%m%d')
m = (dates.dt.month == 1) & (dates.dt.weekofyear > 50)
df['C'] = np.where(m, 1, dates.dt.weekofyear)
print (df)
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 1
In general, this will work, but here are some confusion about year beginning
import datetime
date_from_str = datetime.datetime.strptime
df = pd.DataFrame([[1, 20050121],
[2, 20050111],
[3, 20050205],
[4, 20050101]], columns = ['A','B'])
df['C']= df['B'].astype('str').apply(lambda date:
date_from_str(date,'%Y%m%d').isocalendar()[1])
df
Output is:
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 53
To avoid this some guy from here suggest this ad-hoc:
def correct(date_):
year, week = date_.year, date_.isocalendar()[1]
ret = datetime.strptime('%04d-%02d-1' % (year, week), '%Y-%W-%w')
if date(year, 1, 4).isoweekday() > 4:
ret -= timedelta(days=7)
return ret.isocalendar()[1]
df['C']= df['B'].astype('str').apply(lambda date: correct(date_from_str(date,'%Y%m%d')))
Then, output will be:
A B C
0 1 20050121 3
1 2 20050111 2
2 3 20050205 5
3 4 20050101 1

Python VLOOKUP based on dates - Pandas

Having an issue with pandas df, trying to get the "Count" column based on the date, the code should search for the "date range' within the dates column, and if it is present the 'Count' should be copied into the "Posts" column for the corresponding date
eg: date_range value = 16/02/2017 - code searches for 16/02/2017 in "Dates" column and makes "Posts" equal to the "Count" value of that Date - if the date_range value does not appear - Posts should = 0.
Data Example:
Dates Count date_range Posts
0 07/02/2017 1 16/12/2016 (should = 5)
1 01/03/2017 1 17/12/2016
2 15/02/2017 1 18/12/2016
3 23/01/2017 1 19/12/2016
4 28/02/2017 1 20/12/2016
5 09/02/2017 2 21/12/2016
6 20/03/2017 2 22/12/2016
7 16/12/2016 5
My code looks like this:
DateList = df['Dates'].tolist()
for date in df['date_range']:
if str(date) in DateList:
df['Posts'] = df['Count']
else:
dates_df['Posts'] = 0
However this makes the data map the wrong values to "Posts"
Hopefully I explained this correctly! Thanks in advance for the help!
You can first create dict for matching values and then map by date_range column:
print (df)
Dates Count date_range
0 07/02/2017 1 16/12/2016
1 01/03/2017 1 17/12/2016
2 15/02/2017 1 18/12/2016
3 23/01/2017 1 19/12/2016
4 28/02/2017 1 07/02/2017 <-change value for match
5 09/02/2017 2 21/12/2016
6 20/03/2017 2 22/12/2016
7 16/12/2016 5 22/12/2016
d = df[df['Dates'].isin(df.date_range)].set_index('Dates')['Count'].to_dict()
print (d)
{'16/12/2016': 5, '07/02/2017': 1}
df['Posts'] = df['date_range'].map(d).fillna(0).astype(int)
print (df)
Dates Count date_range Posts
0 07/02/2017 1 16/12/2016 5
1 01/03/2017 1 17/12/2016 0
2 15/02/2017 1 18/12/2016 0
3 23/01/2017 1 19/12/2016 0
4 28/02/2017 1 07/02/2017 1
5 09/02/2017 2 21/12/2016 0
6 20/03/2017 2 22/12/2016 0
7 16/12/2016 5 22/12/2016 0

Categories

Resources