Combine month name and year in a column pandas python - python

df
Year Month Name Avg
2015 Jan 12
2015 Feb 13.4
2015 Mar 10
...................
2019 Jan 11
2019 Feb 11
Code
df['Month Name-Year']= pd.to_datetime(df['Month Name'].astype(str)+df['Year'].astype(str),format='%b%Y')
In the dataframe, df, the groupby output avg is on keys month name and year. So month name and year are actually multilevel indices. I want to create a third column Month Name Year so that I can do some operation (create plots etc) using the data.
The output I am getting using the code is as below:
Year Month Name Avg Month Name-Year
2015 Jan 12 2015-01-01
2015 Feb 13.4 2015-02-01
2015 Mar 10 2015-03-01
...................
2019 Nov 11 2019-11-01
2019 Dec 11 2019-12-01
and so on.
The output I want is 2015-Jan, 2015-Feb etc in Month Name-Year column...or I want 2015-01, 2015-02...2019-11, 2019-12 etc (only year and month, no days).
Please help

One type of solution is converting to datetimes and then change format by Series.dt.to_period or Series.dt.strftime:
df['Month Name-Year']=pd.to_datetime(df['Month Name']+df['Year'].astype(str),format='%b%Y')
#for months periods
df['Month Name-Year1'] = df['Month Name-Year'].dt.to_period('m')
#for 2010-02 format
df['Month Name-Year2'] = df['Month Name-Year'].dt.strftime('%Y-%m')
Simpliest is solution without convert to datetimes only join with - and convert years to strings:
#format 2010-Feb
df['Month Name-Year3'] = df['Year'].astype(str) + '-' + df['Month Name']
...what is same like converting to datetimes and then converting to custom strings:
#format 2010-Feb
df['Month Name-Year31'] = df['Month Name-Year'].dt.strftime('%Y-%b')

Related

Convert column with month and year ("August 2020"...) to datetime

I have dataframe which contains one column of month and year as string :
>>>time index value
January 2021 y 5
January 2021 v 8
May 2020 y 25
June 2020 Y 13
June 2020 x 11
June 2020 v 10
...
I would like to change the column "time" into datetime format so I can sort the table by chronological order.
Is thery any way to do it when the time is string with month name and number?
#edit:
when I do :
result_Table['time']=pd.to_datetime(result_Table['time'],format='%Y-%m-%d')
I recieve error:
ValueError: time data January 2021 doesn't match format specified
Sample dataframe:
df=pd.DataFrame({'time':['January 2021','May 2020','June 2020']})
If you want to specify the format parameter then that should be '%B %Y' instead of '%Y-%m-%d':
df['time']=pd.to_datetime(df['time'],format='%B %Y')
#OR
#you can also simply use:
#df['time']=pd.to_datetime(df['time'])
output of df:
time
0 2021-01-01
1 2020-05-01
2 2020-06-01
For more info regarding format codes visit here

Data normalization and rescaling value in Python

I have a dataset which contains URLs with publish date (YYYY-MM-DD), visits. I want to calculate benchmark (average) of visits for a complete year. Pages were published on different dates.....e. g. Weightage/contribution of 1st page published in Aug (with 10,000 visits) will be more as compare to 2nd page published in March (11,000).
Here is my dataset:
First step:
So first of all I want to add a column (i.e. time frame) in my data set which can calculate the time frame from the Publish date. For example: if the page was published on 2019-12-10, it can give the time frame/duration from my today's date, expected o/p: (Dec 2019, 9 Months). i.e. (Month Year on which the page was published, Total months from today)
Second step:
I want to normalize/rescale my data (visits) on the basis of calculated time frame column in step 1.
How can I calculate average/benchmark.
for the first step you can use following code:
read dataframe
import pandas as pd
df = pd.read_csv("your_df.csv")
My example dataframe as below:
Pub.Dates Type Visits
0 2019-12-10 00:00:00 A 1000
1 2019-12-15 00:00:00 A 5000
2 2018-06-10 00:00:00 B 6000
3 2018-03-04 00:00:00 B 12000
4 2019-02-10 00:00:00 A 3000
for normalizing the date:
at first define a method to normalize just a date:
from datetime import datetime
def normalize_date(date): # input: '2019-12-10 00:00:00'
date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object
date_to_str = date_obj.strftime("%B %Y") # 'December 2019'
diff_date = datetime.now() - date_obj # find diff from today
diff_month = int(diff_date.days / 30) # convert days to month
normalized_value = date_to_str + ", " + str(diff_month) + " months"
return normalized_value # 'December 2019, 9 months'
now apply the above method to all values of the date column:
df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))
The normalized dataframe will be:
Pub.Dates Type Visits
0 December 2019, 9 months A 1000
1 December 2019, 9 months A 5000
2 June 2018, 27 months B 6000
3 March 2018, 31 months B 12000
4 February 2019, 19 months A 3000
5 July 2020, 2 months C 9000
but for the second step if there are multiple records per month you can do the following steps, groupby date and other columns you need then get mean of them:
average_in_visits = df.groupby(("Pub.Dates", "Type")).mean()
the result will be:
Visits
Pub.Dates Type
December 2019, 9 months A 3000
February 2019, 19 months A 3000
July 2020, 2 months C 9000
June 2018, 27 months B 6000
March 2018, 31 months B 12000

Python - Extract year and month from a single column of different year and month arrangements

I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct

Pandas to_datetime convert year-week to date in 2019, first week is wk0

I used to use year and week number to convert to date for some propose.
it works well before 2019, but when i tried to import 2019 wk1 data, it wired.
2019 wk1 becomes between 2019-01-07 ~ 2019-01-03
But on the contrary, if i use date to convert to year and wk, it's correct.
May I know what's wrong with my code? thanks
d = {'year': [2018, 2018, 2018, 2019, 2019, 2019], 'week': [0, 1, 52, 0, 1, 2]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['year'].map(str) + df['week'].map(str) + '-4', format='%Y%W-%w')
df['yearwk'] = pd.to_datetime(df['date'], format='%Y-%M-%D').dt.strftime('%YW%V')
print(df)
year week date yearwk
0 2018 0 2018-01-04 2018W01
1 2018 1 2018-01-04 2018W01
2 2018 52 2018-12-27 2018W52
3 2019 0 2019-01-03 2019W01
4 2019 1 2019-01-10 2019W02
5 2019 2 2019-01-17 2019W03
I use given year and weeknum to convert to date. ideally, 2019WK1 should be between 2018-12-31 to 2019-01-05, but it became 2019-01-06 to 2019-01-13.
then I use that date to convert to year and wk, the result is what's I expected.
According to strftime() and strptime() Behavior,
%W: Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
%w: Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.
You put 4 as %w, indicating you want date of Thursday of the week number you provide.
First week (%W=1) of 2018 that starts with Monday is: Jan 1 - Jan 7 or Jan 4 (Thursday)
First week (%W=1) of 2019 that starts with Monday is: Jan 7 - Jan 13 or Jan 10 (Thursday)
For 2018, there's no Thursday the week before, so %W=0 will output the same as %W=1. However, for 2019, there is a Thursday the week before, so %W=0 will output Jan 3.

convert year to a date with adding some number of day in pandas

I have a dataframe that looks like this:
Year vl
2017 20
2017 21
2017 22
2017 23
2017 24
2017 25
2017 26
...
I need to convert the year into the format dd.mm.yyyy. Every time start from the first day of the year. For example, 2017 will become 01.01.2017. And then, I need to multiply each value in the column "vl" by 7 and add them line by line to the column as the number of days, where the dates will be in the new format (as in the example 01.01.2017).
The result should be something like this:
Year vl new_date
2017 20 21.05.2017
2017 21 28.05.2017
2017 22 04.06.2017
2017 23 11.06.2017
2017 24 18.06.2017
2017 25 25.06.2017
2017 26 02.07.2017
...
Here is one option by pasting the Year (%Y) and Day of the year (%j) together and then parse and reformat it:
from datetime import datetime
df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
#0 21.05.2017
#1 28.05.2017
#2 04.06.2017
#3 11.06.2017
#4 18.06.2017
#5 25.06.2017
#6 02.07.2017
#dtype: object
Assign the column back to the original data frame:
df['new_date'] = df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
Unfortunately %U and %W aren't implemented in Pandas
But we can use the following vectorized approach:
In [160]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.vl.mul(7).astype(str) + ' days')
Out[160]:
0 2017-05-21
1 2017-05-28
2 2017-06-04
3 2017-06-11
4 2017-06-18
5 2017-06-25
6 2017-07-02
dtype: datetime64[ns]

Categories

Resources